Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 |
Date | |
Msg-id | 47b86037-0e70-42a1-8656-9c0bd70ecd8c@vondra.me Whole thread Raw |
In response to | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
|
List | pgsql-hackers |
On 11/11/24 14:51, Ashutosh Bapat wrote: > ... > > I think the problem is about processing older running transactions > record and setting data.restart_lsn based on the candidates those > records produce. But what is not clear to me is how come a newer > candidate_restart_lsn is available immediately upon WAL sender > restart. I.e. in the sequence of events you mentioned in your first > email > 1. 344.139 LOG: starting logical decoding for slot "s1" > > 2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30, > reading WAL from 1/E96FB4D0. > > 3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0 > > 4. 344.140 DETAIL: Logical decoding will begin using saved snapshot. > > 5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398 > > 6. 344.140 LOG: LogicalConfirmReceivedLocation updating > data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0) > candidate_restart_valid 0/0 (from 1/E9865398) > candidate_restart_lsn 0/0 (from 1/E979D4C8) > > how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid > = 1/E9865398 were set in ReplicationSlot after WAL sender? It means it > must have read and processed running transaction record at 1/E9865398. > If that's true, how come it went back to a running transactions WAL > record at 1/E979D4C8? It should be reading WAL records sequentially, > hence read 1/E979D4C8 first then 1/E9865398. > > 344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1 > candidate_restart_valid_lsn 1/E979D4C8 (0/0) > candidate_restart_lsn 1/E96FB4D0 (0/0) > Those are good questions, but IIUC that's explained by this comment from Masahiko-san's analysis [1]: Thinking about the root cause more, it seems to me that the root cause is not the fact that candidate_xxx values are not cleared when being released. In the scenario I reproduced, after restarting the logical decoding, the walsender sets the restart_lsn to a candidate_restart_lsn left in the slot upon receiving the ack from the subscriber. ... If this is correct, then what happens is: 1) replication is running, at some point we set candidate LSN to B 2) something breaks, causing reconnect with restart LSN A (< B) 3) we still have the candidate LSN B in memory, and after receiving some confirmation we set it as restart_lsn 4) we get to decode the RUNNING_XACTS, which moves restart_lsn back If this analysis is correct, I think it's rather suspicious we don't reset the candidate fields on restart. Can those "old" values ever be valid? But I haven't tried resetting them. Also, this is why I'm not entirely sure just tweaking the conditions in LogicalIncreaseRestartDecodingForSlot is quite correct. Maybe it fixes this particular issue, but maybe the right fix would be to reset the candidate fields on reconnect? And this change would be just hiding the actual problem. I haven't tried this. [1] https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com -- Tomas Vondra
pgsql-hackers by date: