Thread: pg_createsubscriber: Fix incorrect handling of cleanup flags

pg_createsubscriber: Fix incorrect handling of cleanup flags

From
Nisha Moond
Date:
Hi Hackers,

In pg_createsubscriber, the flags 'made_publication' and
'made_replslot' are used to track whether the tool itself created a
publication or replication slot on the primary (publisher) node. If a
failure happens during the process, these flags help the tool decide
whether it should clean up those objects using
cleanup_objects_atexit().
However, there are cases where these flags are wrongly set to false
due to failures on the subscriber side, which causes the tool to skip
cleanup of some objects on the primary even when it should not.

Example: for made_publication
  In drop_publication(), if dropping a publication on the subscriber
fails (either a replicated publication or an existing one being
removed with --remove=publications), the made_publication flag is
wrongly set to false.
The process continues without exiting, but if a later step fails,
cleanup_objects_atexit() will see made_publication = false and skip
dropping the publication on the primary, even though it was created
earlier by the tool. This leads to orphaned publication.

Example: for made_replslot
  A similar issue exists for replication slots. In
drop_replication_slot(), if dropping a physical replication slot on
the primary, or a failover-synced slot on the subscriber, fails — the
made_replslot flag is set to false.
Again, if the process fails later, cleanup_objects_atexit() will
incorrectly skip dropping the logical replication slot created earlier
on the primary, leaving it behind.

Solution:
  The fix ensures that failures in dropping subscriber-side or
non-internal objects should not reset made_publication or
made_replslot.
These flags should only be reset if dropping the internally created
objects on the primary fails. That way, cleanup_objects_atexit() can
still correctly clean up what the tool created if something else goes
wrong later in the process.

Attached is the patch implementing the above proposed solution.
Reviews and feedback are most welcome.

--
Thanks,
Nisha

Attachment

Re: pg_createsubscriber: Fix incorrect handling of cleanup flags

From
"David G. Johnston"
Date:
On Sun, May 4, 2025 at 8:45 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
Attached is the patch implementing the above proposed solution.
Reviews and feedback are most welcome.

I feel like this is just papering over the issue - which is that these two drop functions are being used for multiple differently named publications/slots yet take no care to ensure they only change made_publication and made_repslot if the name of the object being passed in matches the name of the object those two booleans are specifically tracking (the application created objects on the publisher).

Make it so they are only changed to false if the name matches the one the program created and the connection is the primary connection.  That targets the real issue and avoids using a branching boolean parameter.

It seems really odd to say: if (in_cleanup) "don't try again" - since by definition this is the last thing we are doing before we exit.  So really what this patch can do more simply is just remove the dbinfo->made_replslot=false and *made_publication=false lines altogether - which might be a valid option.

I'm partial to the latter really, I don't think the error message output for retrying a drop that may have previously failed would be an issue.

David J.