Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date
Msg-id 47b86037-0e70-42a1-8656-9c0bd70ecd8c@vondra.me
Whole thread Raw
In response to Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
List pgsql-hackers
On 11/11/24 14:51, Ashutosh Bapat wrote:
> ...
>
> I think the problem is about processing older running transactions
> record and setting data.restart_lsn based on the candidates those
> records produce. But what is not clear to me is how come a newer
> candidate_restart_lsn is available immediately upon WAL sender
> restart. I.e. in the sequence of events you mentioned in your first
> email
> 1.  344.139 LOG:  starting logical decoding for slot "s1"
> 
> 2. 344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
>                    reading WAL from 1/E96FB4D0.
> 
>  3. 344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0
> 
>  4. 344.140 DETAIL:  Logical decoding will begin using saved snapshot.
> 
>   5. 344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398
> 
> 6.  344.140 LOG:  LogicalConfirmReceivedLocation updating
>                    data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
>                    candidate_restart_valid 0/0 (from 1/E9865398)
>                    candidate_restart_lsn 0/0 (from 1/E979D4C8)
> 
> how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
> = 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
> must have read and processed running transaction record at 1/E9865398.
> If that's true, how come it went back to a running transactions WAL
> record at 1/E979D4C8? It should be reading WAL records sequentially,
> hence read 1/E979D4C8 first then 1/E9865398.
> 
>  344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
>                    candidate_restart_valid_lsn 1/E979D4C8 (0/0)
>                    candidate_restart_lsn 1/E96FB4D0 (0/0)
> 

Those are good questions, but IIUC that's explained by this comment from
Masahiko-san's analysis [1]:

  Thinking about the root cause more, it seems to me that the root cause
  is not the fact that candidate_xxx values are not cleared when being
  released.

  In the scenario I reproduced, after restarting the logical decoding,
  the walsender sets the restart_lsn to a candidate_restart_lsn left in
  the slot upon receiving the ack from the subscriber. ...

If this is correct, then what happens is:

  1) replication is running, at some point we set candidate LSN to B

  2) something breaks, causing reconnect with restart LSN A (< B)

  3) we still have the candidate LSN B in memory, and after receiving
     some confirmation we set it as restart_lsn

  4) we get to decode the RUNNING_XACTS, which moves restart_lsn back


If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

Also, this is why I'm not entirely sure just tweaking the conditions in
LogicalIncreaseRestartDecodingForSlot is quite correct. Maybe it fixes
this particular issue, but maybe the right fix would be to reset the
candidate fields on reconnect? And this change would be just hiding the
actual problem. I haven't tried this.


[1]
https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com


-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Add html-serve target to autotools and meson
Next
From: Dave Page
Date:
Subject: PG17 failing tests (DST related?)