Hi Vignesh, Amit,
We encountered a situation where a customer dropped a publication
accidentally and that broke logical replication in an irrecoverable
manner. This is PG 15.3 but the team confirmed that the behaviour is
reproducible with PG 17 as well.
When a WAL sender processes a WAL record recording a change in
publication, it ends up calling LoadPublication() which throws an
error if a publication mentioned in START_REPLICATION command is not
found. The downstream tries to reconnect but the WAL sender again
repeats the same process going in an error loop. Creating the
publication does not help since WAL sender will always encounter the
WAL record dropping the publication first.
There are ways to come out of this situation, but not very clean always
1. Remove publication from subscription, run logical replication till
it passes the point where publication was added, add the publication
back and continue. It's not always possible to know when the
publication was added back and thus it becomes tedious or next to
impossible to apply these steps.
2. Reseeding the replication slot which involves copying all the data
again and not feasible in case of large databases.
3. Skipping the transaction which dropped the publication. This will
work if drop publication was the only thing in that transaction but
not otherwise. Confirming that is tricky and requires some expert
help.
In PG 18 onwards, this behaviour is fixed by throwing a WARNING
instead of an error. In the relevant thread [1] where the fix to PG 18
was discussed, backpatching was also discussed. Back then it was
deferred because of lack of field reports. But we are seeing this
situation now. So maybe it's time to backpatch the fix. Further PG 15
documentation mentions that
https://www.postgresql.org/docs/15/sql-createsubscription.html. So the
users will expect that their logical replication will not be affected
(except for the data published by the publication) if a publication is
dropped or does not exist. So, backpatching the change would make the
behaviour compatible with the documentation.
The backport seems to be straight forward. Please let me know if you
need my help in doing so, if we decide to backport the fix.
--
Best Wishes,
Ashutosh Bapat