Re: Dropping publication breaks logical replication - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Dropping publication breaks logical replication |
Date | |
Msg-id | CAExHW5tL4jp3RQbi-YWpwuv_e=L5iwJESqai1K5EKC1CcZQJBg@mail.gmail.com Whole thread Raw |
In response to | Re: Dropping publication breaks logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Fri, Aug 1, 2025 at 4:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 1, 2025 at 10:54 AM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > Hi Vignesh, Amit, > > We encountered a situation where a customer dropped a publication > > accidentally and that broke logical replication in an irrecoverable > > manner. This is PG 15.3 but the team confirmed that the behaviour is > > reproducible with PG 17 as well. > > > > When a WAL sender processes a WAL record recording a change in > > publication, it ends up calling LoadPublication() which throws an > > error if a publication mentioned in START_REPLICATION command is not > > found. The downstream tries to reconnect but the WAL sender again > > repeats the same process going in an error loop. Creating the > > publication does not help since WAL sender will always encounter the > > WAL record dropping the publication first. > > > > There are ways to come out of this situation, but not very clean always > > 1. Remove publication from subscription, run logical replication till > > it passes the point where publication was added, add the publication > > back and continue. It's not always possible to know when the > > publication was added back and thus it becomes tedious or next to > > impossible to apply these steps. > > 2. Reseeding the replication slot which involves copying all the data > > again and not feasible in case of large databases. > > 3. Skipping the transaction which dropped the publication. This will > > work if drop publication was the only thing in that transaction but > > not otherwise. Confirming that is tricky and requires some expert > > help. > > > > In PG 18 onwards, this behaviour is fixed by throwing a WARNING > > instead of an error. In the relevant thread [1] where the fix to PG 18 > > was discussed, backpatching was also discussed. Back then it was > > deferred because of lack of field reports. But we are seeing this > > situation now. > > > > Thanks for the report. One more reason we were hesitant to backpatch > was that it is possible that some users may expect replication to stop > in this case as mentioned by Tomas in one of his emails [1] ("See the > para starting with "Imagine you have a subscriber ..." in his email"). > We thought, as it could be perceived as a behavior change, so better > to do it as a HEAD only change. Yes, that's a valid concern. We have to choose between missing some changes because of missing publication and an irrecoverable error. The latter seems more serious. The first is covered by our documentation - maybe indirectly and we throw a WARNING. So choosing the second seems a better option. Maybe we could do a better job at documenting this. I wish we could pass a "missing_ok" flag with START_REPLICATION command, but we can't do that in the back branches. And we haven't done that when we committed the fix to PG 18. > > Now, seeing this report, it seems the customer(s) are probably okay to > skip a missing publication and let replication continue. So, we should > consider backpatching this change but it would be better if few more > people can share their opinion on this matter. Including Tomas for his opinion. Who else do you think can provide an opinion based on experience? Thinking aloud about what you suggest in [1] in the same thread. The problem there is, upstream can not access downstream subscription and has no control over them so it can not avoid dropping a publication even if it's being used by a subscription. What at most we can do is not allow dropping a publication being used by a running WAL sender by locking publication in use somehow. However, even that won't help much. Assume that a WAL sender disconnects for some other reason, followed by the publication getting dropped. We end up in the same situation. [1] https://www.postgresql.org/message-id/CAA4eK1K40xhObN1MWO7%3DrzrJmo%2BoQ048O8drZ86-F7artVvwQQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: