Home > mailing lists

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date	November 8, 2024 21:25:17
Msg-id	CAD21AoBgn6-dehGxv1gbZegkssUpTA5fWGxv1_9GsSBBZH88rQ@mail.gmail.com Whole thread Raw
In response to	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 (Tomas Vondra <tomas@vondra.me>)
Responses	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
List	pgsql-hackers

Tree view

Hi,

Thank you for investigating this issue.

On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> I kept investigating this, but I haven't made much progress. I still
> don't understand why would it be OK to move any of the LSN fields
> backwards - certainly for fields like confirm_flush or restart_lsn.
>
> I did a simple experiment - added asserts to the couple places in
> logical.c updating the the LSN fields, checking the value is increased.
> But then I simply ran make check-world, instead of the stress test.
>
> And that actually fails too, 040_standby_failover_slots_sync.pl triggers
> this
>
>     {
>         SpinLockAcquire(&MyReplicationSlot->mutex);
>         Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
>         MyReplicationSlot->data.confirmed_flush = lsn;
>         SpinLockRelease(&MyReplicationSlot->mutex);
>     }
>
> So this moves confirm_flush back, albeit only by a tiny amount (I've
> seen ~56 byte difference). I don't have an example of this causing an
> issue in practice, but I note that CheckPointReplicationSlots does this:
>
>     if (is_shutdown && SlotIsLogical(s))
>     {
>         SpinLockAcquire(&s->mutex);
>
>         if (s->data.invalidated == RS_INVAL_NONE &&
>             s->data.confirmed_flush > s->last_saved_confirmed_flush)
>         {
>             s->just_dirtied = true;
>             s->dirty = true;
>         }
>         SpinLockRelease(&s->mutex);
>     }
>
> to determine if a slot needs to be flushed to disk during checkpoint. So
> I guess it's possible we save a slot to disk at some LSN, then the
> confirm_flush moves backward, and we fail to sync the slot to disk.
>
> But I don't have a reproducer for this ...
>
>
> I also noticed a strange difference between LogicalIncreaseXminForSlot
> and LogicalIncreaseRestartDecodingForSlot.
>
> The structure of LogicalIncreaseXminForSlot looks like this:
>
>     if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
>     {
>     }
>     else if (current_lsn <= slot->data.confirmed_flush)
>     {
>         ... update candidate fields ...
>     }
>     else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
>     {
>         ... update candidate fields ...
>     }
>
> while LogicalIncreaseRestartDecodingForSlot looks like this:
>
>     if (restart_lsn <= slot->data.restart_lsn)
>     {
>     }
>     else if (current_lsn <= slot->data.confirmed_flush)
>     {
>         ... update candidate fields ...
>     }
>
>     if (slot->candidate_restart_valid == InvalidXLogRecPtr)
>     {
>         ... update candidate fields ...
>     }
>
> Notice that LogicalIncreaseXminForSlot has the third block guarded by
> "else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
> that a bit suspicious, considering the functions do the same thing, just
> for different fields? I don't know if this is dangerous, the comments
> suggest it may just waste extra effort after reconnect.
>

I also suspected this point. I still need to investigate if this
suspicion is related to the issue but I find this code in
LogicalIncreaseRestartDecodingForSlot() is dangerous.

We update slot's restart_lsn based on candidate_lsn and
candidate_valid upon receiving a feedback message from a subscriber,
then clear both fields. Therefore, this code in
LogicalIncreaseRestartDecodingForSlot() means that it sets an
arbitrary LSN to candidate_restart_lsn after updating slot's
restart_lsn.

I think an LSN older than slot's restart_lsn can be passed to
LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
after logical decoding restarts; My scenario I shared on another
thread was that after updating slot's restart_lsn (upon feedback from
a subscriber) based on both candidate_restart_lsn and
candidate_restart_valid that are remained in the slot, we might call
LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
record whose LSN is older than the slot's new restart_lsn. In this
case, we end up passing an LSN older than the new restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
candidate_restart_lsn. My hypothesis is that we wanted to prevent such
case by the first if block:

    /* don't overwrite if have a newer restart lsn */
    if (restart_lsn <= slot->data.restart_lsn)
    {
    }

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoBG2OSDOFTtpPtQ7fx5Vt8p3dS5hPAv28CBSC6z2kHx-g%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Nathan Bossart
Date: 08 November 2024, 21:07:45
Subject: Re: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

From: Peter Geoghegan
Date: 08 November 2024, 21:40:13
Subject: Re: Avoiding superfluous buffer locking during nbtree backwards scans

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers

Previous

Next