Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date
Msg-id 77ff7cb2-f820-42a3-82b5-8c585ee8afbf@vondra.me
Whole thread Raw
In response to Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On 11/15/24 04:26, Amit Kapila wrote:
> On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 11/13/24 11:59, Amit Kapila wrote:
>>> On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
>>> <ashutosh.bapat.oss@gmail.com> wrote:
>>>>
>>>> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>
>>>>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>>>>
>>>>>>
>>>>>> But neither of those fixes prevents backwards move for confirmed_flush
>>>>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this
>>>>>> assert is incorrect or now. It seems natural that once we get a
>>>>>> confirmation for some LSN, we can't move before that position, but I'm
>>>>>> not sure about that. Maybe it's too strict.
>>>>>
>>>>> Hmm, I'm concerned that it might be another problem. I think there are
>>>>> some cases where a subscriber sends a flush position older than slot's
>>>>> confirmed_flush as a feedback message.
>>>>>
>>>
>>> Right, it can happen for cases where subscribers doesn't have to do
>>> anything (for example DDLs) like I have explained in one of my emails
>>> [1]
>>>
>>
>> Thanks. I admit not being entirely familiar with all the details, but
>> doesn't that email explain more "Why it currently happens?" rather than
>> "Is this what should be happening?"
>>
> 
> Right.
> 
>> Sure, the subscriber needs to confirm changes for which nothing needs to
>> be done, like DDL. But isn't there a better way to do that, rather than
>> allowing confirmed_lsn to go backwards?
>>
> 
> Firstly, I agree that we should try to find ways to avoid going
> confirmed_lsn going backward. We can try to explore solutions both in
> the publisher and subscriber-side.
> 

Good that we agree.

I'm not sure how could we do this on the subscriber side. Or more
precisely, I see "LSNs don't go backwards" as an invariant that the
publisher can enforce for all subscribers (with whatever plugin).

And if a subscriber responds in a way that contradicts the invariant,
I'd treat that as a subscriber bug. I don't think we should rely on
subscriber to do the right thing - we have little control over that, and
when things break (say, WAL gets removed too early), people generally
point at the publisher.

>>
>> I think what Ashutosh is saying that if confirmed_flush is allowed to
>> move backwards, that may result in start_lsn moving backwards too. And
>> we need to be able to decode all transactions committed since start_lsn,
>> so if start_lsn moves backwards, maybe restart_lsn needs to move
>> backwards too. I have no idea if confirmed_flush/start_lsn can move
>> backwards enough to require restart_lsn to move, though.
>>
>> Anyway, these discussions are a good illustration why I think allowing
>> these LSNs to move backwards is a problem. It either causes bugs (like
>> with breaking replication slots) and/or it makes the reasoning about
>> correct behavior much harder.
>>
> 
> Right, I think one needs to come up with some reproducible scenarios
> where these cause any kind of problem or inefficiency. Then, we can
> discuss the solutions accordingly. I mean to say that someone has to
> put effort into making a bit more solid case for changing this code
> because it may not be a good idea to change something just based on
> some theory unless it is just adding some assertions.
> 

I don't know, but isn't this a bit backwards? I understand we don't want
to just recklessly change long-standing code, but I think it's sensible
to enforce sensible invariants like "LSNs can't go backwards" to ensure
correct behavior. And then if there's a performance/efficiency problem,
and someone proposes a fix, it's up to that patch to prove it's correct.

Here we seem to have code no one is quite sure is correct, but we're
asking for reproducible scenarios proving the existence of a bug. The
bugs can be quite subtle / hard to reproduce, as evidenced by the
restart_lsn issue present since 9.4. Maybe this should be the other way
around, i.e. make it "strict" and then prove that (a) relaxing the check
is still correct and (b) it actually has other benefits.

That being said, I don't have a clear idea how to change this.

So +1 to at least introducing the asserts.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: Re: Potential ABI breakage in upcoming minor releases
Next
From: "David E. Wheeler"
Date:
Subject: Re: Potential ABI breakage in upcoming minor releases