Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 |
Date | |
Msg-id | 77ff7cb2-f820-42a3-82b5-8c585ee8afbf@vondra.me Whole thread Raw |
In response to | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On 11/15/24 04:26, Amit Kapila wrote: > On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 11/13/24 11:59, Amit Kapila wrote: >>> On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat >>> <ashutosh.bapat.oss@gmail.com> wrote: >>>> >>>> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> >>>>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas@vondra.me> wrote: >>>>>> >>>>>> >>>>>> But neither of those fixes prevents backwards move for confirmed_flush >>>>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this >>>>>> assert is incorrect or now. It seems natural that once we get a >>>>>> confirmation for some LSN, we can't move before that position, but I'm >>>>>> not sure about that. Maybe it's too strict. >>>>> >>>>> Hmm, I'm concerned that it might be another problem. I think there are >>>>> some cases where a subscriber sends a flush position older than slot's >>>>> confirmed_flush as a feedback message. >>>>> >>> >>> Right, it can happen for cases where subscribers doesn't have to do >>> anything (for example DDLs) like I have explained in one of my emails >>> [1] >>> >> >> Thanks. I admit not being entirely familiar with all the details, but >> doesn't that email explain more "Why it currently happens?" rather than >> "Is this what should be happening?" >> > > Right. > >> Sure, the subscriber needs to confirm changes for which nothing needs to >> be done, like DDL. But isn't there a better way to do that, rather than >> allowing confirmed_lsn to go backwards? >> > > Firstly, I agree that we should try to find ways to avoid going > confirmed_lsn going backward. We can try to explore solutions both in > the publisher and subscriber-side. > Good that we agree. I'm not sure how could we do this on the subscriber side. Or more precisely, I see "LSNs don't go backwards" as an invariant that the publisher can enforce for all subscribers (with whatever plugin). And if a subscriber responds in a way that contradicts the invariant, I'd treat that as a subscriber bug. I don't think we should rely on subscriber to do the right thing - we have little control over that, and when things break (say, WAL gets removed too early), people generally point at the publisher. >> >> I think what Ashutosh is saying that if confirmed_flush is allowed to >> move backwards, that may result in start_lsn moving backwards too. And >> we need to be able to decode all transactions committed since start_lsn, >> so if start_lsn moves backwards, maybe restart_lsn needs to move >> backwards too. I have no idea if confirmed_flush/start_lsn can move >> backwards enough to require restart_lsn to move, though. >> >> Anyway, these discussions are a good illustration why I think allowing >> these LSNs to move backwards is a problem. It either causes bugs (like >> with breaking replication slots) and/or it makes the reasoning about >> correct behavior much harder. >> > > Right, I think one needs to come up with some reproducible scenarios > where these cause any kind of problem or inefficiency. Then, we can > discuss the solutions accordingly. I mean to say that someone has to > put effort into making a bit more solid case for changing this code > because it may not be a good idea to change something just based on > some theory unless it is just adding some assertions. > I don't know, but isn't this a bit backwards? I understand we don't want to just recklessly change long-standing code, but I think it's sensible to enforce sensible invariants like "LSNs can't go backwards" to ensure correct behavior. And then if there's a performance/efficiency problem, and someone proposes a fix, it's up to that patch to prove it's correct. Here we seem to have code no one is quite sure is correct, but we're asking for reproducible scenarios proving the existence of a bug. The bugs can be quite subtle / hard to reproduce, as evidenced by the restart_lsn issue present since 9.4. Maybe this should be the other way around, i.e. make it "strict" and then prove that (a) relaxing the check is still correct and (b) it actually has other benefits. That being said, I don't have a clear idea how to change this. So +1 to at least introducing the asserts. regards -- Tomas Vondra
pgsql-hackers by date: