Thread: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

31 October 2024, 13:18:34

Dear Hackers,

I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.

If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.

The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.

I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.

I reproduced the problem by the following way:

Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
Hard restart the server right after checkpoint completion.
After restart slot's restart_lsn may point to removed WAL segment.

The proposed patch fixes it.

[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

20 November 2024, 16:40:45

Dear Hackers,

To ping the topic, I'd like to clarify what may be wrong with the idea described here, because I do not see any interest from the community. The topic is related to physical replication. The primary idea is to define the horizon of WAL segments (files) removal based on saved on disk restart LSN values. Now, the WAL segment removal horizon is calculated based on the current restart LSN values of slots, that can not be saved on disk at the time of the horizon calculation. The case take place when a slot is advancing during checkpoint as described earlier in the topic.

Such behaviour is not a problem when slots are used only for physical replication in a conventional way. But it may be a problem when physical slot is used for some other goals. For example, I have an extension which keeps the WAL using physical replication slots. It creates a new physical slot and advances it as needed. After restart, it can use restart lsn of the slot to read WAL from this LSN. In this case, there is no guarantee that restart lsn will point to an existing WAL segment.

The advantage of the current behaviour is that it requires a little bit less WAL to keep. The disadvantage is that physical slots do not guarantee WAL keeping starting from its' restart lsns in general.

I would be happy to get some advice, whether I am on the right or wrong way. Thank you in advance.

With best regards,

Vitaly

On Thursday, November 07, 2024 16:30 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:

Dear Hackers,

I'd like to introduce an improved version of my patch (see the attached file). My original idea was to take into account saved on disk restart_lsn (slot→restart_lsn_flushed) for persistent slots when removing WAL segment files. It helps tackle errors like: ERROR: requested WAL segment 000...0AA has already been removed.

Improvements:
flushed_restart_lsn is used only for RS_PERSISTENT slots.
Save physical slot on disk when advancing only once - if restart_lsn_flushed is invalid. It is needed because slots with invalid restart LSN are not used when calculating oldest LSN for WAL truncation. Once restart_lsn becomes valid, it should be saved to disk immediately to update restart_lsn_flushed.
Regression tests seems to be ok except:
recovery/t/001_stream_rep.pl (checkpoint is needed)
recovery/t/019_replslot_limit.pl (it seems, slot was invalidated, some adjustments are needed)
pg_basebackup/t/020_pg_receivewal.pl (not sure about it)

There are some problems:
More WAL segments may be kept. It may lead to invalidations of slots in some tests (recovery/t/019_replslot_limit.pl). A couple of tests should be adjusted.

With best regards,
Vitaly Davydov

On Thursday, October 31, 2024 13:32 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:

Sorry, attached the missed patch.

On Thursday, October 31, 2024 13:18 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:

Dear Hackers,

I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.

If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.

The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.

I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.

I reproduced the problem by the following way:
Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
Hard restart the server right after checkpoint completion.
After restart slot's restart_lsn may point to removed WAL segment.
The proposed patch fixes it.

[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tomas Vondra

Date:

20 November 2024, 20:24:40

On 10/31/24 11:18, Vitaly Davydov wrote:
> Dear Hackers,
> 
>  
> 
> I'd like to discuss a problem with replication slots's restart LSN.
> Physical slots are saved to disk at the beginning of checkpoint. At the
> end of checkpoint, old WAL segments are recycled or removed from disk,
> if they are not kept by slot's restart_lsn values.
> 

I agree that if we can lose WAL still needed for a replication slot,
that is a bug. Retaining the WAL is the primary purpose of slots, and we
just fixed a similar issue for logical replication.

> 
> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.
> 

Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:

  ... something ...

  CheckPointGuts(checkPoint.redo, flags);

  ... something ...

  RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
                     checkPoint.ThisTimeLineID);

The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?

> 
> The doc [0] describes that restart_lsn may be set to the some past value
> after reload. There is a discussion [1] on pghackers where such
> behaviour is discussed. The main reason of not flushing physical slots
> on advancing is a performance reason. I'm ok with such behaviour, except
> of that the corresponding WAL segments should not be removed.
> 

I don't know which part of [0] you refer to, but I guess you're
referring to this:

    The current position of each slot is persisted only at checkpoint,
    so in the case of a crash the slot may return to an earlier LSN,
    which will then cause recent changes to be sent again when the
    server restarts. Logical decoding clients are responsible for
    avoiding ill effects from handling the same message more than once.

Yes, it's fine if we discard the new in-memory restart_lsn value, and we
do this for performance reasons - flushing the slot on every advance
would be very expensive. I haven't read [1] as it's quite long, but I
guess that's what it says.

But we must not make any "permanent" actions based on the unflushed
value, I think. Like, we should not remove WAL segments, for example.

>  
> 
> I propose to keep WAL segments by saved on disk (flushed) restart_lsn of
> slots. Add a new field restart_lsn_flushed into ReplicationSlot
> structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It
> doesn't change the format of storing the slot contents on disk. I
> attached a patch. It is not yet complete, but demonstate a way to solve
> the problem.
> 

That seems like a possible fix this, yes. And maybe it's the right one.

> 
> I reproduced the problem by the following way:
> 
>   * Add some delay in CheckPointBuffers (pg_usleep) to emulate long
>     checkpoint execution.
>   * Execute checkpoint and pg_replication_slot_advance right after
>     starting of the checkpoint from another connection.
>   * Hard restart the server right after checkpoint completion.
>   * After restart slot's restart_lsn may point to removed WAL segment.
> 
> The proposed patch fixes it.
> 

I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(

Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?

What confuses me a bit is that we update the restart_lsn (and call
ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
all the time. Walsender does that in PhysicalConfirmReceivedLocation for
example. So we actually see the required LSN to move during checkpoint
very often. So how come we don't see the issues much more often? Surely
I miss something important.

Another option might be that pg_replication_slot_advance() doesn't do
something it should be doing. For example, shouldn't be marking the slot
as dirty?

regards

-- 
Tomas Vondra

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tomas Vondra

Date:

20 November 2024, 20:40:50


On 11/20/24 14:40, Vitaly Davydov wrote:
> Dear Hackers,
> 
>  
> 
> To ping the topic, I'd like to clarify what may be wrong with the idea
> described here, because I do not see any interest from the community.
> The topic is related to physical replication. The primary idea is to
> define the horizon of WAL segments (files) removal based on saved on
> disk restart LSN values. Now, the WAL segment removal horizon is
> calculated based on the current restart LSN values of slots, that can
> not be saved on disk at the time of the horizon calculation. The case
> take place when a slot is advancing during checkpoint as described
> earlier in the topic.
> 

Yeah, a simple way to fix this might be to make sure we don't use the
required LSN value set after CheckPointReplicationSlots() to remove WAL.
AFAICS the problem is KeepLogSeg() gets the new LSN value by:

   keep = XLogGetReplicationSlotMinimumLSN();

Let's say we get the LSN before calling CheckPointGuts(), and then pass
it to KeepLogSeg, so that it doesn't need to get the fresh value.

Wouldn't that fix the issue?

> 
> Such behaviour is not a problem when slots are used only for physical
> replication in a conventional way. But it may be a problem when physical
> slot is used for some other goals. For example, I have an extension
> which keeps the WAL using physical replication slots. It creates a new
> physical slot and advances it as needed. After restart, it can use
> restart lsn of the slot to read WAL from this LSN. In this case, there
> is no guarantee that restart lsn will point to an existing WAL segment.
> 

Yeah.

> 
> The advantage of the current behaviour is that it requires a little bit
> less WAL to keep. The disadvantage is that physical slots do not
> guarantee WAL keeping starting from its' restart lsns in general.
> 

If it's wrong, it doesn't really matter it has some advantages.


regards

-- 
Tomas Vondra

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tomas Vondra

Date:

21 November 2024, 01:19:52

On 11/20/24 18:24, Tomas Vondra wrote:
>
> ...
>
> What confuses me a bit is that we update the restart_lsn (and call
> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
> example. So we actually see the required LSN to move during checkpoint
> very often. So how come we don't see the issues much more often? Surely
> I miss something important.
> 

This question "How come we don't see this more often?" kept bugging me,
and the answer is actually pretty simple.

The restart_lsn can move backwards after a hard restart (for the reasons
explained), but physical replication does not actually rely on that. The
replica keeps track of the LSN it received (well, it uses the same LSN),
and on reconnect it sends the startpoint to the primary. And the primary
just proceeds use that instead of the (stale) restart LSN for the slot.
And the startpoint is guaranteed (I think) to be at least restart_lsn.

AFAICS this would work for pg_replication_slot_advance() too, that is if
you remember the last LSN the slot advanced to, it should be possible to
advance to it just fine. Of course, it requires a way to remember that
LSN, which for a replica is not an issue. But this just highlights we
can't rely on restart_lsn for this purpose.

(Apologies if this repeats something obvious, or something you already
said, Vitaly.)

regards

-- 
Tomas Vondra

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tomas Vondra

Date:

21 November 2024, 16:59:27

On 11/20/24 23:19, Tomas Vondra wrote:
> On 11/20/24 18:24, Tomas Vondra wrote:
>>
>> ...
>>
>> What confuses me a bit is that we update the restart_lsn (and call
>> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
>> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
>> example. So we actually see the required LSN to move during checkpoint
>> very often. So how come we don't see the issues much more often? Surely
>> I miss something important.
>>
> 
> This question "How come we don't see this more often?" kept bugging me,
> and the answer is actually pretty simple.
> 
> The restart_lsn can move backwards after a hard restart (for the reasons
> explained), but physical replication does not actually rely on that. The
> replica keeps track of the LSN it received (well, it uses the same LSN),
> and on reconnect it sends the startpoint to the primary. And the primary
> just proceeds use that instead of the (stale) restart LSN for the slot.
> And the startpoint is guaranteed (I think) to be at least restart_lsn.
> 
> AFAICS this would work for pg_replication_slot_advance() too, that is if
> you remember the last LSN the slot advanced to, it should be possible to
> advance to it just fine. Of course, it requires a way to remember that
> LSN, which for a replica is not an issue. But this just highlights we
> can't rely on restart_lsn for this purpose.
> 

I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.

For streaming, the subscriber sends the startpoint (just like physical
replication), so it should be protected too.

But then there's the SQL API - pg_logical_slot_get_changes(). And it
turns out it ends up syncing the slot to disk pretty often, because for
RUNNING_XACTS we call LogicalDecodingProcessRecord() + standby_decode(),
which ends up calling SaveSlotToDisk(). And at the end we call
LogicalConfirmReceivedLocation() for good measure, which saves the slot
too, just to be sure.

FWIW I suspect this still is not perfectly safe, because we may still
crash / restart before the updated data.restart_lsn makes it to disk,
but after it was already used to remove old WAL, although that's
probably harder to hit. With streaming the subscriber will still send us
the new startpoint, so that should not fail I think. But with the SQL
API we probably can get into the "segment already removed" issues.

I haven't tried reproducing this yet, I guess it should be possible
using the injection points. Not sure when I get to this, though.

In any case, doesn't this suggest SaveSlotToDisk() really is not that
expensive, if we do it pretty often for logical replication? Which was
presented as the main reason why pg_replication_slot_advance() doesn't
do that. Maybe it should?

If the advance is substantial I don't think it really matters, because
there simply can't be that many of large advances. It amortizes, in a
way. But even with smaller advances it should be fine, I think - if the
goal is to not remove WAL prematurely, it's enough to flush when we move
to the next segment.

regards

-- 
Tomas Vondra

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

21 November 2024, 17:56:05

Hi Tomas,

Thank you for the reply and your interest to the investigation.

On Wednesday, November 20, 2024 20:24 MSK, Tomas Vondra <tomas@vondra.me> wrote:

> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.
>

Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:

... something ...

CheckPointGuts(checkPoint.redo, flags);

... something ...

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
checkPoint.ThisTimeLineID);

The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?

They key action here is to restart the instance with -m immediate (or kill it and start it again) right after checkpoint. After restart, the slot's restart_lsn will be read from the disk and address to the removed WAL segment, if it's LSN was davanced enought to switch to a new WAL segment.

I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(

Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?

I added pg_usleep as show below (in CheckPointBuffers function):

CheckPointBuffers(int flags)
{
BufferSync(flags);
+ pg_usleep(10000000);
}

Below is the instruction how I run my test (pg_usleep should be added to the code):

CONSOLE> initdb -D pgdata
CONSOLE> pg_ctl -D pgdata -l logfile start
... open two psql terminals and connect to the database (lets call them as PSQL-1, PSQL-2)
PSQL-1> select pg_create_physical_replication_slot('myslot', true, false);
CONSOLE> pgbench -i -s 10 postgres # create some WAL records
PSQL-1> checkpoint; -- press ENTER key and go to PSQL-2 console and execute next line in 1 second
PSQL-2> select pg_replication_slot_advance('myslot', pg_current_wal_lsn()); -- advance repslot during checkpoint
... wait for checkpoint to complete
CONSOLE> pg_ctl -D pgdata -m immediate stop
CONSOLE> pg_ctl -D pgdata start
PSQL-1> \c
PSQL-1> create extension pg_walinspect;
PSQL-1> select pg_get_wal_record_info(restart_lsn) from pg_replication_slots where slot_name = 'myslot';
ERROR: requested WAL segment pg_wal/000000010000000000000001 has already been removed

I'm trying to create a perl test to reproduce it. Please, give me some time to create the test script.

I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.

Yes, it is not applied to logical replication, because logical slot is synced when advancing.

eah, a simple way to fix this might be to make sure we don't use the
required LSN value set after CheckPointReplicationSlots() to remove WAL.
AFAICS the problem is KeepLogSeg() gets the new LSN value by:

keep = XLogGetReplicationSlotMinimumLSN();

Let's say we get the LSN before calling CheckPointGuts(), and then pass
it to KeepLogSeg, so that it doesn't need to get the fresh value.

Yes, it is another solution and it can fix the problem. The question - which solution to choose. Well, I prefer to add a new in-memory state variable in the slot structure. Such variable may be useful if we want to check whether the slot data is synced or not. The calculation of the keep value before CheckPointGuts(), IMHO, requires to change signatures of a number of functions. I may prepare a new patch where your solution is implemented.

I'm sorry, if I missed to answer to some other questions. I will answer later.

With best regards,

Vitaly

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

03 March, 18:12:12

Dear Hackers,

Let me please introduce a new version of the patch.

Patch description:

The slot data is flushed to the disk at the beginning of checkpoint. If
an existing slot is advanced in the middle of checkpoint execution, its
advanced restart LSN is taken to calculate the oldest LSN for WAL
segments removal at the end of checkpoint. If the node is restarted just
after the checkpoint, the slots data will be read from the disk at
recovery with the oldest restart LSN which can refer to removed WAL
segments.

The patch introduces a new in-memory state for slots -
flushed_restart_lsn which is used to calculate the oldest LSN for WAL
segments removal. This state is updated every time with the current
restart_lsn at the moment, when the slot is saving to disk.

With best regards,
Vitaly

Dear All,

Thank you for the attention to the patch. I updated a patch with a better
solution for the master branch which can be easily backported to the other
branches as we agree on the final solution.

Two tests are introduced which are based on Tomas Vondra's test for logical slots
with injection points from the discussion (patches: [1], [2], [3]). Tests
are implemented as module tests in src/test/modules/test_replslot_required_lsn
directory. I slightly modified the original test for logical slots and created a
new simpler test for physical slots without any additional injection points.

I prepared a new solution (patch [4]) which is also based on Tomas Vondra's
proposal. With a fresh eye, I realized that it can fix the issue as well. It is
easy and less invasive to implement. The new solution differs from my original
solution: it is backward compatible (doesn't require any changes in ReplicationSlot
structure). My original solution can be backward compatible as well if to
allocate flushed_restart_lsn in a separate array in shmem, not in the
ReplicationSlot structure, but I believe the new solution is the better one. If
you still think that my previous solution is the better (I don't think so), I
will prepare a backward compatible patch with my previous solution.

I also proposed one more commit (patch [5]) which removes unnecessary calls of
ReplicationSlotsComputeRequiredLSN function which seems to be redundant. This
function updates the oldest required LSN for slots and it is called every time
when slots' restart_lsn is changed. Once, we use the oldest required LSN in
CreateCheckPoint/CreateRestartPoint to remove old WAL segments, I believe, there
is no need to calculate the oldest value immediately when the slot is advancing
and in other cases when restart_lsn is changed. It may affect on
GetWALAvailability function because the oldest required LSN will be not up to
date, but this function seems to be used in the system view
pg_get_replication_slots and doesn't affect the logic of old WAL segments
removal. I also have some doubts concerning advancing of logical replication
slots: the call of ReplicationSlotsComputeRequiredLSN was removed. Not sure, how
it can affect on ReplicationSlotsComputeRequiredXmin. This commit is not
necessary for the fix but I think it is worth to consider. It may be dropped or
applied only to the master branch.

This patch can be easily backported to the major release branches. I will
quickly prepare the patches for the major releases as we agree on the final
solution.

I apologize for such too late change in patch when it is already on commitfest.
I'm not well experienced yet in the internals of PostgreSQL at the moment,
sometimes the better solution needs some time to grow. In doing we learn :)

[1] 0001-Add-injection-points-to-test-replication-slot-advanc.v2.patch
[2] 0002-Add-TAP-test-to-check-logical-repl-slot-advance-duri.v2.patch
[3] 0003-Add-TAP-test-to-check-physical-repl-slot-advance-dur.v2.patch
[4] 0004-Keep-WAL-segments-by-slot-s-flushed-restart-LSN.v2.patch
[5] 0005-Remove-redundant-ReplicationSlotsComputeRequiredLSN-.v2.patch

With best regards,
Vitaly

Attachment

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

20 May, 18:44:44

Hi Alexander,

Thank you very much for the review!

> The patchset doesn't seem to build after 371f2db8b0, which adjusted
> the signature of the INJECTION_POINT() macro.  Could you, please,
> update the patchset accordingly.

I've updated the patch (see attached). Thanks.

> I see in 0004 patch we're calling XLogGetReplicationSlotMinimumLSN()
> before slots synchronization then use it for WAL truncation.
> Generally looks good.  But what about the "if
> (InvalidateObsoleteReplicationSlots(...))" branch?  It calls
> XLogGetReplicationSlotMinimumLSN() again.  Why would the value
> obtained from the latter call reflect slots as they are synchronized
> to the disk?

In patch 0004 I call XLogGetReplicationSlotMinimumLSN() again to keep the old
behaviour - this function was called in KeepLogSeg prior to my change. I also
call CheckPointReplicationSlots at the next line to save the invalidated and
other dirty slots on disk again to make sure, the new oldest LSN is in sync.

The problem I tried to solve in this if-branch is to fix test
src/test/recovery/t/019_replslot_limit.pl which was failed because the WAL was
not truncated enought for the test to pass ok. In general, this branch is not
necessary and we may fix the test by calling checkpoint twice (please, see the
alternative.rej patch for this case). If you think, we should incorporate this
new change, I'm ok to do it. But the WAL will be truncating more lazily.

Furthermore, I think we can save slots on disk right after invalidation, not in
CheckPointGuts to avoid saving invalidated slots twice.

With best regards,
Vitaly

Attachment

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

24 May, 16:28:55

Hi, Amit!

Thank you for your attention to this patchset!

On Sat, May 24, 2025 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > I spend more time on this.  The next revision is attached.  It
> > contains revised comments and other cosmetic changes.  I'm going to
> > backpatch 0001 to all supported branches,
> >
>
> Is my understanding correct that we need 0001 because
> PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> changing the slot's restart_lsn?

Yes.  Also, even if it would save slot to the disk, there is still
race condition that concurrent checkpoint could use updated value from
the shared memory to clean old WAL segments, and then crash happens
before we managed to write the slot to the disk.

> If so, shouldn't the comments (One
> could argue that the slot should be saved to disk now, but that'd be
> energy wasted - the worst thing lost information could cause here is
> to give wrong information in a statistics view) in
> PhysicalConfirmReceivedLocation() be changed to mention the risk of
> not saving the slot?
>
> Also, after 0001, even the same solution will be true for logical
> slots as well, where we are already careful to save the slot to disk
> if its restart_lsn is changed, see LogicalConfirmReceivedLocation().
> So, won't that effort be wasted? Even if we don't want to do anything
> about it (which doesn't sound like a good idea), we should note that
> in comments somewhere.

I have added the comments about both points in the attached revision
of the patchset.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Amit Kapila

Date:

26 May, 09:49:00

On Sat, May 24, 2025 at 6:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi, Amit!
>
> Thank you for your attention to this patchset!
>
> On Sat, May 24, 2025 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >
> > > I spend more time on this.  The next revision is attached.  It
> > > contains revised comments and other cosmetic changes.  I'm going to
> > > backpatch 0001 to all supported branches,
> > >
> >
> > Is my understanding correct that we need 0001 because
> > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > changing the slot's restart_lsn?
>
> Yes.  Also, even if it would save slot to the disk, there is still
> race condition that concurrent checkpoint could use updated value from
> the shared memory to clean old WAL segments, and then crash happens
> before we managed to write the slot to the disk.
>

How can that happen, if we first write the updated value to disk and
then update the shared memory as we do in
LogicalConfirmReceivedLocation?

BTW, won't there be a similar problem with physical slot's xmin
computation as well? In PhysicalReplicationSlotNewXmin(), after
updating the slot's xmin computation, we mark it as dirty and update
shared memory values. Now, say after checkpointer writes these xmin
values to disk, walsender receives another feedback message and
updates the slot's xmin values. Now using these updated shared memory
values, vacuum removes the rows, however, a restart will show the
older xmin values in the slot, which mean vacuum would have removed
the required rows before restart.

As per my understanding, neither the xmin nor the LSN problem exists
for logical slots. I am pointing this out to indicate we may need to
think of a different solution for physical slots, if these are
problems only for physical slots.

--
With Regards,
Amit Kapila.

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Vitaly Davydov"

Date:

26 May, 13:22:51

Dear Alexander, Amit, All

> Amit wrote:
> > Is my understanding correct that we need 0001 because
> > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > changing the slot's restart_lsn?
>
> Yes.  Also, even if it would save slot to the disk, there is still
> race condition that concurrent checkpoint could use updated value from
> the shared memory to clean old WAL segments, and then crash happens
> before we managed to write the slot to the disk.
>
> How can that happen, if we first write the updated value to disk and
> then update the shared memory as we do in
> LogicalConfirmReceivedLocation?

I guess, that the problem with logical slots still exist. Please, see the tap
test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch.
A race condition may happen when logical slot's restart_lsn was changed but not
yet written to the disk. Imagine, there is another physical slot which is
advanced at this moment. It recomputes oldest min LSN and takes into account
changed but not saved to disk restart_lsn of the logical slot. We come to the
situation when the WAL segment for the logical slot's restart_lsn may be
truncated after immediate restart.

I'm not sure what may happen with two checkpoints which execute in parallel, but
I would say that the patch 0001 guarantees that every checkpoint run will trim
the WAL segments based on the already saved on disk restart LSNs of slots. The
rule to trim the WAL by saved slot's restart_lsn will not be violated.

> Amit wrote:
> As per my understanding, neither the xmin nor the LSN problem exists
> for logical slots. I am pointing this out to indicate we may need to
> think of a different solution for physical slots, if these are
> problems only for physical slots.

As I've already told, it indirectly affects the logical slots as well.

> Alexander wrote:
> I spend more time on this.  The next revision is attached.  It
> contains revised comments and other cosmetic changes.  I'm going to
> backpatch 0001 to all supported branches, and 0002 to 17 where
> injection points were introduced.

Alexander, thank you for polishing the patch. Just my opinion, I would prefer
to put tests before the fix due to reason that you can reproduce the problem
when simply checkout the commit with tests. Once, the tests are after the fix
you are not able to do this way. Anyway, I'm ok with your changes. Thank you!

I did some changes in the patch (v7 is attached):
* Removed modules/test_replslot_required_lsn directory. It is not needed anymore,
once you've moved test files to another directory.
* Renamed tests to 046_checkpoint_logical_slot.pl, 047_checkpoint_physical_slot.pl.
I believe, such names are more descriptive.

Please, consider these changes.

With best regards,
Vitaly

Hello Alexander,

10.06.2025 23:14, Alexander Korotkov wrote:

So, my proposal is to commit the attached patchset to the HEAD, and
commit [1] to the back branches.  Any objections?

As the buildfarm animal prion shows [1], the 046_checkpoint_logical_slot
test fails with "-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE":
# poll_query_until timed out executing this query:
#
#         SELECT count(*) > 0 FROM pg_stat_activity
#         WHERE backend_type = 'client backend' AND wait_event = 'logical-replication-slot-advance-segment'
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[04:16:27] t/046_checkpoint_logical_slot.pl ......
Dubious, test returned 29 (wstat 7424, 0x1d00)
No subtests run
[04:20:58] t/047_checkpoint_physical_slot.pl ..... ok   271294 ms ( 0.00 usr 0.00 sys + 0.37 cusr 0.26 csys = 0.63 CPU)

I'm able to reproduce this locally as well. Though the test passes for me
with the increased timeout, that is it's not stuck:
PG_TEST_TIMEOUT_DEFAULT=360 PROVE_TESTS="t/046*" make -s check -C src/test/recovery/
# +++ tap check in src/test/recovery +++
t/046_checkpoint_logical_slot.pl .. ok
All tests successful.
Files=1, Tests=1, 533 wallclock secs ( 0.01 usr 0.00 sys + 4.70 cusr 9.61 csys = 14.32 CPU)
Result: PASS

Could you have a look?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-06-14%2001%3A58%3A06

Best regards,
Alexander

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

15 June, 12:50:18

Hi, Alexander!

On Sun, Jun 15, 2025 at 12:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Alexander,
>
> 10.06.2025 23:14, Alexander Korotkov wrote:
>
> So, my proposal is to commit the attached patchset to the HEAD, and
> commit [1] to the back branches.  Any objections?
>
>
> As the buildfarm animal prion shows [1], the 046_checkpoint_logical_slot
> test fails with "-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE":
> # poll_query_until timed out executing this query:
> #
> #         SELECT count(*) > 0 FROM pg_stat_activity
> #         WHERE backend_type = 'client backend' AND wait_event = 'logical-replication-slot-advance-segment'
> #
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> [04:16:27] t/046_checkpoint_logical_slot.pl ......
> Dubious, test returned 29 (wstat 7424, 0x1d00)
> No subtests run
> [04:20:58] t/047_checkpoint_physical_slot.pl ..... ok   271294 ms ( 0.00 usr  0.00 sys +  0.37 cusr  0.26 csys =
0.63CPU) 
>
> I'm able to reproduce this locally as well. Though the test passes for me
> with the increased timeout, that is it's not stuck:
> PG_TEST_TIMEOUT_DEFAULT=360 PROVE_TESTS="t/046*" make -s check -C src/test/recovery/
> # +++ tap check in src/test/recovery +++
> t/046_checkpoint_logical_slot.pl .. ok
> All tests successful.
> Files=1, Tests=1, 533 wallclock secs ( 0.01 usr  0.00 sys +  4.70 cusr  9.61 csys = 14.32 CPU)
> Result: PASS
>
> Could you have a look?
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-06-14%2001%3A58%3A06

Hmm... It seems to take too long to advance the segment with these
options on.  Sure, I'll check this!

------
Regards,
Alexander Korotkov
Supabase

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tom Lane

Date:

15 June, 19:05:56

BTW, while you're cleaning up this commit, could you remove the
excess newlines in some of the "note" commands in 046 and 047, like

note('starting checkpoint\n');

This produces bizarre output, as shown in the buildfarm logs:

[04:04:38.953](603.550s) # starting checkpoint\\n

            regards, tom lane

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

16 June, 04:58:51

Hi, Tom!

On Sun, Jun 15, 2025 at 7:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, while you're cleaning up this commit, could you remove the
> excess newlines in some of the "note" commands in 046 and 047, like
>
> note('starting checkpoint\n');
>
> This produces bizarre output, as shown in the buildfarm logs:

Thank you for reporting this.  The revised patch is attached.  In
addition to reducing tests runtime, it removes excess newlines from
some note() calls.  The commit message is here.  I'm going to push
this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v2-0001-Improve-runtime-and-output-of-tests-for-replicati.patch

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Hayato Kuroda (Fujitsu)"

Date:

16 June, 12:11:27

Dear Alexander,

Thanks for pushing the fix patch! BTW, I have few comments for your commits.
Can you check and include them if needed?

01.
```
$node->append_conf('postgresql.conf',
    "shared_preload_libraries = 'injection_points'");
```

No need to set shared_preload_libraries in 046/047. ISTM it must be set when we
enable the statistics.

02.
We should also check whether the injection_points can be installed or not.
You can check check_extension() and callers.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

16 June, 14:16:42

Dear Kuroda-san,

On Mon, Jun 16, 2025 at 12:11 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> Thanks for pushing the fix patch! BTW, I have few comments for your commits.
> Can you check and include them if needed?
>
> 01.
> ```
> $node->append_conf('postgresql.conf',
>         "shared_preload_libraries = 'injection_points'");
> ```
>
> No need to set shared_preload_libraries in 046/047. ISTM it must be set when we
> enable the statistics.
>
> 02.
> We should also check whether the injection_points can be installed or not.
> You can check check_extension() and callers.

Thank you!  All of these totally make sense.  The updated patch is attached.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v3-0001-Improve-runtime-and-output-of-tests-for-replicati.patch

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Hayato Kuroda (Fujitsu)"

Date:

16 June, 14:44:44

Dear Alexander,

> Thank you!  All of these totally make sense.  The updated patch is attached.

Thanks for the update. I found another point.

```
-# Another 2M rows; that's about 260MB (~20 segments) worth of WAL.
+# Another 50K rows; that's about 86MB (~5 segments) worth of WAL.
 $node->safe_psql('postgres',
-    q{insert into t (b) select md5(i::text) from generate_series(1,1000000) s(i)}
+    q{insert into t (b) select repeat(md5(i::text),50) from generate_series(1,50000) s(i)}
 );
```

I think a perl function advance_wal() can be used instead of doing actual INSERT
command because no one refers the replicated result. Same thing can be said in
046/047.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

17 June, 18:08:11

Hi, Vitaly!

On Tue, Jun 17, 2025 at 6:02 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> Thank you for reporting the issue.
>
> >While tracking buildfarm for one of other commits, I noticed this failure:
> >TRAP: failed Assert("s->data.restart_lsn >=
> >s->last_saved_restart_lsn"), File:
> >"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797
> >postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e]
> >postgres: standby: checkpointer
> >(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171]
> >postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2]
>
> This assert was introduced in the patch. Now, I think, it is a wrong one. Let me
> please explain one of the possible scenarios when it can be triggered. In case
> of physical replication, when walsender receives a standby reply message, it
> calls PhysicalConfirmReceivedLocation function which updates slots' restart_lsn
> from received flush_lsn value. This value may be older than the saved value. If
> it happens during checkpoint, after slot saving to disk, this assert will be
> triggered, because last_saved_restart_lsn value may be lesser than the new
> restart_lsn value, updated from walsender.
>
> I propose to remove this assert.

Yes, but is it OK for restart_lsn to move backward?  That might mean
that if checkpoint happen faster than
PhysicalConfirmReceivedLocation(), then crash could cause this WAL
location to be unavailable.  Is that true?

Also, what do you think about proposed changes in [1]?  I wonder if it
could somehow decrease the coverage.

Links.
1.
https://www.postgresql.org/message-id/OSCPR01MB149665B3F0629D10731B18E5AF570A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

------
Regards,
Alexander Korotkov
Supabase

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Tom Lane

Date:

17 June, 23:18:30

vignesh C <vignesh21@gmail.com> writes:
> While tracking buildfarm for one of other commits, I noticed this failure:
> TRAP: failed Assert("s->data.restart_lsn >=
> s->last_saved_restart_lsn"), File:
> "../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797

My animal mamba is also showing this assertion failure, but in a
different test (recovery/t/040_standby_failover_slots_sync.pl).
It's failed in two out of its three runs since ca307d5ce went in,
so it's more reproducible than scorpion's report, though still not
perfectly so.

I suspect that mamba is prone to this simply because it's slow,
although perhaps there's a different reason.  Anyway, happy to
investigate manually if there's something you'd like me to
check for.

            regards, tom lane

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

vignesh C

Date:

18 June, 13:03:17

On Wed, 18 Jun 2025 at 14:35, Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
>
> Dear Hayato,
>
> > To confirm, can you tell me the theory why the walsender received old LSN?
> > It is sent by the walreceiver, so is there a case that LogstreamResult.Flush can go backward?
> > Not sure we can accept the situation.
>
> I can't say anything about the origin of the issue, but it can be easily reproduced
> on the master branch:
>
> 1. Add an assert in PhysicalConfirmReceivedLocation (apply the attached patch)
> 2. Compile & install with tap tests and assertions enabled
> 3. cd src/bin/pg_basebackup/
> 3. PROVE_TESTS=t/020_pg_receivewal.pl gmake check

Thanks for the steps, I was able to reproduce the issue with the
suggested steps.

> The test will fail because of the assertion. I plan to investigate the issue
> but I need some more time for it. Once, it happens on the original master
> branch, I think, this problem already exists. The proposed patch seems
> to be not guilty.

This issue occurs even prior to this commit, I was able to reproduce
it on a version just before it. I’ll also look into analyzing the root
cause further.

> It may be the same problem as discussed in:
> https://www.postgresql.org/message-id/CALDaNm2uQbhEVJzvnja6rw7Q9AYu9FpVmET%3DTbwLjV3DcPRhLw%40mail.gmail.com

This issue was related to confirmed_flush and was addressed in commit
d1ffcc7fa3c54de8b2a677a3e503fc808c7b419c. It is not related to
restart_lsn.

Regards,
Vignesh

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Amit Kapila

Date:

19 June, 13:29:25

On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
>
> On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> > > I think, it is a good idea. Once we do not use the generated data, it is ok
> > > just to generate WAL segments using the proposed function. I've tested this
> > > function. The tests worked as expected with and without the fix. The attached
> > > patch does the change.
> >
> > Sorry, forgot to attach the patch. It is created on the current master branch.
> > It may conflict with your corrections. I hope, it could be useful.
>
> Thank you.  I've integrated this into a patch to improve these tests.
>
> Regarding assertion failure, I've found that assert in
> PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> previously set by ReplicationSlotReserveWal().  As I can see,
> ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> So, it doesn't seems there is a guarantee that restart_lsn never goes
> backward.  The commit in ReplicationSlotReserveWal() even states there
> is a "chance that we have to retry".
>

I don't see how this theory can lead to a restart_lsn of a slot going
backwards. The retry mentioned there is just a retry to reserve the
slot's position again if the required WAL is already removed. Such a
retry can only get the position later than the previous restart_lsn.

>  Thus, I propose to remove the
> assertion introduced by ca307d5cec90.
>

If what I said above is correct, then the following part of the commit
message will be incorrect:
"As stated in the ReplicationSlotReserveWal() comment, this is not
always true. Additionally, this issue has been spotted by some
buildfarm
members."

--
With Regards,
Amit Kapila.

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Hayato Kuroda (Fujitsu)"

Date:

19 June, 14:05:49

Dear Amit, Alexander,

> > Regarding assertion failure, I've found that assert in
> > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > previously set by ReplicationSlotReserveWal().  As I can see,
> > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > backward.  The commit in ReplicationSlotReserveWal() even states there
> > is a "chance that we have to retry".
> >
> 
> I don't see how this theory can lead to a restart_lsn of a slot going
> backwards. The retry mentioned there is just a retry to reserve the
> slot's position again if the required WAL is already removed. Such a
> retry can only get the position later than the previous restart_lsn.

We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
and confirmed that restart_lsn can go backward. This meant that Assert() added
by the ca307d5 can cause crash.

Background
===========
When pg_receivewal starts the replication and it uses the replication slot, it
sets as the beginning of the segment where restart_lsn exists, as the startpoint.
E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
from 0/B00000.
More detail of this behavior, see f61e1dd2 and d9bae531.

What happened here
==================
Based on above theory, walsender sent from the beginning of segment (0/B00000).
When walreceiver receives, it tried to send reply. At that time the flushed
location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).

The assertion failure could happen if CHECKPOINT happened at that time.
Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().

Note
====
1.
In this case, starting from the beginning of the segment is not a problem, because
the checkpoint process only removes WAL files from segments that precede the
restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
so there is no risk of data loss or inconsistency.

2.
A similar pattern applies to pg_basebackup. Both use logic that adjusts the
requested streaming position to the start of the segment, and it replies the
received LSN as flushed.

3.
I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
because it is a timing issue. Have someone else reproduced?

We are still investigating failure caused at 040_standby_failover_slots_sync.

[1]:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=scorpion&dt=2025-06-17%2000%3A40%3A46&stg=pg_basebackup-check

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

20 June, 03:18:20

On Thu, Jun 19, 2025 at 1:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> > > > I think, it is a good idea. Once we do not use the generated data, it is ok
> > > > just to generate WAL segments using the proposed function. I've tested this
> > > > function. The tests worked as expected with and without the fix. The attached
> > > > patch does the change.
> > >
> > > Sorry, forgot to attach the patch. It is created on the current master branch.
> > > It may conflict with your corrections. I hope, it could be useful.
> >
> > Thank you.  I've integrated this into a patch to improve these tests.
> >
> > Regarding assertion failure, I've found that assert in
> > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > previously set by ReplicationSlotReserveWal().  As I can see,
> > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > backward.  The commit in ReplicationSlotReserveWal() even states there
> > is a "chance that we have to retry".
> >
>
> I don't see how this theory can lead to a restart_lsn of a slot going
> backwards. The retry mentioned there is just a retry to reserve the
> slot's position again if the required WAL is already removed. Such a
> retry can only get the position later than the previous restart_lsn.

Yes, if retry is needed, then the new position must be later for sure.
What I mean is that ReplicationSlotReserveWal() can reserve something
later than what standby is going to read (and correspondingly report
with PhysicalConfirmReceivedLocation()).

> >  Thus, I propose to remove the
> > assertion introduced by ca307d5cec90.
> >
>
> If what I said above is correct, then the following part of the commit
> message will be incorrect:
> "As stated in the ReplicationSlotReserveWal() comment, this is not
> always true. Additionally, this issue has been spotted by some
> buildfarm
> members."

I agree, this comment needs improvement in terms of clarity.

Meanwhile I've pushed the patch for TAP tests, which I think didn't
get any objections.

------
Regards,
Alexander Korotkov
Supabase

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Alexander Korotkov

Date:

20 June, 03:24:16

Dear Kuroda-san,

On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> > > Regarding assertion failure, I've found that assert in
> > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > previously set by ReplicationSlotReserveWal().  As I can see,
> > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > > backward.  The commit in ReplicationSlotReserveWal() even states there
> > > is a "chance that we have to retry".
> > >
> >
> > I don't see how this theory can lead to a restart_lsn of a slot going
> > backwards. The retry mentioned there is just a retry to reserve the
> > slot's position again if the required WAL is already removed. Such a
> > retry can only get the position later than the previous restart_lsn.
>
> We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
> and confirmed that restart_lsn can go backward. This meant that Assert() added
> by the ca307d5 can cause crash.
>
> Background
> ===========
> When pg_receivewal starts the replication and it uses the replication slot, it
> sets as the beginning of the segment where restart_lsn exists, as the startpoint.
> E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
> from 0/B00000.
> More detail of this behavior, see f61e1dd2 and d9bae531.
>
> What happened here
> ==================
> Based on above theory, walsender sent from the beginning of segment (0/B00000).
> When walreceiver receives, it tried to send reply. At that time the flushed
> location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
> in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).
>
> The assertion failure could happen if CHECKPOINT happened at that time.
> Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
> was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().

Thank you for your detailed explanation!

> Note
> ====
> 1.
> In this case, starting from the beginning of the segment is not a problem, because
> the checkpoint process only removes WAL files from segments that precede the
> restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
> so there is no risk of data loss or inconsistency.
>
> 2.
> A similar pattern applies to pg_basebackup. Both use logic that adjusts the
> requested streaming position to the start of the segment, and it replies the
> received LSN as flushed.
>
> 3.
> I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
> because it is a timing issue. Have someone else reproduced?
>
> We are still investigating failure caused at 040_standby_failover_slots_sync.

I didn't manage to reproduce this.  But as I see from the logs [1] on
mamba that START_REPLICATION command was issued just before assert
trap.  Could it be something similar to what I described in [2].
Namely:
1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
2. Concurrent checkpoint flushes that restart_lsn to the disk.
3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
the beginning of the segment.

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mamba&dt=2025-06-17%2005%3A10%3A36&stg=recovery-check
[2] https://www.postgresql.org/message-id/CAPpHfdv3UEUBjsLhB_CwJT0xX9LmN6U%2B__myYopq4KcgvCSbTg%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

vignesh C

Date:

20 June, 15:41:57

On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Dear Kuroda-san,
>
> On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> > > > Regarding assertion failure, I've found that assert in
> > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > > previously set by ReplicationSlotReserveWal().  As I can see,
> > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > > > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > > > backward.  The commit in ReplicationSlotReserveWal() even states there
> > > > is a "chance that we have to retry".
> > > >
> > >
> > > I don't see how this theory can lead to a restart_lsn of a slot going
> > > backwards. The retry mentioned there is just a retry to reserve the
> > > slot's position again if the required WAL is already removed. Such a
> > > retry can only get the position later than the previous restart_lsn.
> >
> > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
> > and confirmed that restart_lsn can go backward. This meant that Assert() added
> > by the ca307d5 can cause crash.
> >
> > Background
> > ===========
> > When pg_receivewal starts the replication and it uses the replication slot, it
> > sets as the beginning of the segment where restart_lsn exists, as the startpoint.
> > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
> > from 0/B00000.
> > More detail of this behavior, see f61e1dd2 and d9bae531.
> >
> > What happened here
> > ==================
> > Based on above theory, walsender sent from the beginning of segment (0/B00000).
> > When walreceiver receives, it tried to send reply. At that time the flushed
> > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
> > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).
> >
> > The assertion failure could happen if CHECKPOINT happened at that time.
> > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
> > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().
>
> Thank you for your detailed explanation!
>
> > Note
> > ====
> > 1.
> > In this case, starting from the beginning of the segment is not a problem, because
> > the checkpoint process only removes WAL files from segments that precede the
> > restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
> > so there is no risk of data loss or inconsistency.
> >
> > 2.
> > A similar pattern applies to pg_basebackup. Both use logic that adjusts the
> > requested streaming position to the start of the segment, and it replies the
> > received LSN as flushed.
> >
> > 3.
> > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
> > because it is a timing issue. Have someone else reproduced?
> >
> > We are still investigating failure caused at 040_standby_failover_slots_sync.
>
> I didn't manage to reproduce this.  But as I see from the logs [1] on
> mamba that START_REPLICATION command was issued just before assert
> trap.  Could it be something similar to what I described in [2].
> Namely:
> 1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
> 2. Concurrent checkpoint flushes that restart_lsn to the disk.
> 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
> the beginning of the segment.

Here is my analysis for the 040_standby_failover_slots_sync test
failure where in physical replication slot can point to backward lsn:
In certain rare cases the restart LSN can go backwards. This scenario
can be reproduced by performing checkpoint continuously and slowing
the WAL applying. I have a patch with changes to handle this.

Here's a summary of the sequence of events:
1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents:
After standby writes the received WAL contents in XLogWalRcvWrite, the
standby sends this lsn 0/40369C8 as the confirmed flush location. The
primary acknowledges this and updates the replication slot's
restart_lsn accordingly:
2025-06-20 14:33:21.777 IST [134998] standby1 LOG:
PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
restart_lsn to 0/40369C8
2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT:
START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
Checkpoint persists the new restart_lsn:

2) Shortly after receiving the new LSN, a checkpoint occurs which
saves this restart_lsn:
2025-06-20 14:33:21.780 IST [134913] LOG:  checkpoint complete: wrote
0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0
removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync
files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0
kB; lsn=0/4036A20, redo lsn=0/40369C8

3)Streaming replication is restarted because of primary_conninfo
change and reload
The WAL receiver process is restarted:
25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver
process due to administrator command

4) Standby sends an older flush pointer after restart:
Upon restart, the WAL receiver sends a flush location (0/401D448)
derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older
than the previously confirmed restart_lsn. It is important to note
that we are sending the lastReplayedEndRecPtr which is the last
successfully replayed lsn in this case:
5-06-20 14:33:21.796 IST [135135] LOG:  WalReceiverMain
LogstreamResult.Flush initialized to 0/401D448
2025-06-20 14:33:21.796 IST [135135] LOG:  sending write 0/401D448
flush 0/401D448 apply 0/401D448

This is done from here:
....
/* Initialize LogstreamResult and buffers for processing messages */
LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);

initStringInfo(&reply_message);

/* Initialize nap wakeup times. */
now = GetCurrentTimestamp();
for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
WalRcvComputeNextWakeup(i, now);

/* Send initial reply/feedback messages. */
XLogWalRcvSendReply(true, false);
...

In case of step 1, we are sending the lsn of the WAL that is written,
since we have slowed down the WAL applying the replay location is
lesser and the replay location is being sent here.

5) I have added logs to detect this inconsistency:
This leads to a scenario where the standby tries to confirm a
restart_lsn older than the one currently held by the primary:
2025-06-20 14:33:21.797 IST [135136] standby1 LOG:  crash scenario -
slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older
than the current one (0/40369C8)

If a checkpoint happens concurrently during this condition, it may
trigger an assertion failure on the primary due to the restart_lsn
being less than the last_saved_restart_lsn.
Currently this does not break physical replication, but I'm not sure
if the gap increases to many WAL files and if the WAL files get
deleted, how it will behave.
Attached the patch changes with which you can reproduce. Grep for
"crash scenario" in the logs. For me it occurs with every run. The
reproduced logs are attached.

This proves that the restart_lsn can go backward in cases where the
standby is slowly applying. But this has nothing to do with this
thread, I felt you can commit the assert removal patch. I will
continue the analysis further and see if there is any impact or not
and we can later add comments accordingly.

Regards,
Vignesh

On Wed, Jun 25, 2025 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Jun 25, 2025 at 1:18 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> > Another idea is to call ReplicationSlotsComputeRequiredLSN() when at least one
> > of the restart_lsn is updated, like attached. I feel this could reduce the computation
> > bit more.
>
> Right, that makes sense, if there is nothing updated on disk then we
> can avoid computing this.

Good idea.  But I think we should associate the "updated" flag
directly to the fact that one slot (no matter logical or physical)
changed its last_saved_restart_lsn.  See the attached patch.  I'm
going to push it if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v1-0001-Fix-CheckPointReplicationSlots-with-max_replicati.patch

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Hayato Kuroda (Fujitsu)"

Date:

26 June, 06:46:44

Dear Alexander,
> 
> Good idea.  But I think we should associate the "updated" flag
> directly to the fact that one slot (no matter logical or physical)
> changed its last_saved_restart_lsn.  See the attached patch.  I'm
> going to push it if no objections.

+        /*
+         * Track if we're going to update slot's last_saved_restart_lsn.
+         * We need this to know if we need to recompute the required LSN.
+         */
+        if (s->last_saved_restart_lsn != s->data.restart_lsn)
+            last_saved_restart_lsn_updated = true;

I feel no need to set to true if last_saved_restart_lsn_updated is already true.
Other than that it's OK for me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

"Hayato Kuroda (Fujitsu)"

Date:

27 June, 05:22:27

Dear Alexander,

> Regarding last_saved_restart_lsn_updated, I think the opposite.  I
> think we should check if last_saved_restart_lsn_updated is set already
> only if it could promise us some economy of resources.  In our case
> the main check only compares two fields of slot.  And that fields are
> to be accessed anyway.  So, we are not going to save any RAM accesses.
> Therefore, checking for last_saved_restart_lsn_updated seems like
> unnecessary code complication (and I don't see we're doing that in
> other places).  So, I'm going to push this patch "as is".

To clarify: I have no objections. Thanks for giving the knowledge.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From

Dilip Kumar

Date:

27 June, 07:17:19

On Thu, Jun 26, 2025 at 3:20 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Wed, Jun 25, 2025 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Wed, Jun 25, 2025 at 1:18 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > > Another idea is to call ReplicationSlotsComputeRequiredLSN() when at least one
> > > of the restart_lsn is updated, like attached. I feel this could reduce the computation
> > > bit more.
> >
> > Right, that makes sense, if there is nothing updated on disk then we
> > can avoid computing this.
>
> Good idea.  But I think we should associate the "updated" flag
> directly to the fact that one slot (no matter logical or physical)
> changed its last_saved_restart_lsn.  See the attached patch.  I'm
> going to push it if no objections.
>

Looks good to me.


--
Regards,
Dilip Kumar
Google