Thread: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Bharath Rupireddy
Date:
Hi,
Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations more time-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable logical replication that require the postgres running on standby or the subscriber to go down. If stopping postgres server takes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as there are many layers in between.
How about we provide a function to force-drop a replication slot? All other things such as stopping postgres and gracefully unprovisioning VM etc. can be taken care of in the background. This force-drop function will also have to ensure that the walsender that's active for the replication slot is terminated gracefully without letting postmaster restart the other backends (right now if a wal sender is exited/terminated, the postmaster restarts all other backends too). The main advantage of the force-drop function is that the disable operations can be quicker and there is no down time/crash on the primary/source server.
How about we provide a function to force-drop a replication slot? All other things such as stopping postgres and gracefully unprovisioning VM etc. can be taken care of in the background. This force-drop function will also have to ensure that the walsender that's active for the replication slot is terminated gracefully without letting postmaster restart the other backends (right now if a wal sender is exited/terminated, the postmaster restarts all other backends too). The main advantage of the force-drop function is that the disable operations can be quicker and there is no down time/crash on the primary/source server.
Thoughts?
[1] ERROR: replication slot "foo" is active for PID 2598155
Regards,
Bharath Rupireddy.Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Tom Lane
Date:
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > How about we provide a function to force-drop a replication slot? Isn't this akin to filing off the safety interlock on the loaded revolver you keep in your hip pocket? IMO the entire point of replication slots is to not make it easy to lose data. regards, tom lane
Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Amit Kapila
Date:
On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations more time-consumingor stuck in production environments. These operations are - disable async/sync standbys and disable logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as thereare many layers in between. > Why do you want to drop the slot when the server is going down? Is it some temporary replication slot, otherwise, how will you resume replication after restarting the server? -- With Regards, Amit Kapila.
Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Bharath Rupireddy
Date:
On Thu, Jun 9, 2022 at 11:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > > How about we provide a function to force-drop a replication slot? > > Isn't this akin to filing off the safety interlock on the loaded revolver > you keep in your hip pocket? IMO the entire point of replication slots > is to not make it easy to lose data. Agree. How about making the function superuser-only? Regards, Bharath Rupireddy.
Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Bharath Rupireddy
Date:
On Thu, Jun 9, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations moretime-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as thereare many layers in between. > > > > Why do you want to drop the slot when the server is going down? Is it > some temporary replication slot, otherwise, how will you resume > replication after restarting the server? The setup is this - primary, bunch of sync standbys, bunch of read replicas (async standbys), bunch of logical replication subscribers - now, the user wants to remove any of them for whatever reasons, typical flow is to first stop the server, if stopping the server takes time (for instance the standbys or subscribers lag behind the primary by too much), kill the VM/host server to make the corresponding replication slots inactive on the primary and then drop the replication slots. The proposed force-drop function helps speed up these operations in production environments and it will also be possible to provide an SLA for these disable operations. I hope the user case is clear. Regards, Bharath Rupireddy.
Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
"Hsu, John"
Date:
Hi, Why couldn't you terminate the active_pid associated with the slot you want to drop if it's active prior to dropping? On 6/10/22 3:03 AM, Bharath Rupireddy wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > > On Thu, Jun 9, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy >> <bharath.rupireddyforpostgres@gmail.com> wrote: >>> Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations moretime-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as thereare many layers in between. >>> >> Why do you want to drop the slot when the server is going down? Is it >> some temporary replication slot, otherwise, how will you resume >> replication after restarting the server? > The setup is this - primary, bunch of sync standbys, bunch of read > replicas (async standbys), bunch of logical replication subscribers - > now, the user wants to remove any of them for whatever reasons, > typical flow is to first stop the server, if stopping the server takes > time (for instance the standbys or subscribers lag behind the primary > by too much), kill the VM/host server to make the corresponding > replication slots inactive on the primary and then drop the > replication slots. The proposed force-drop function helps speed up > these operations in production environments and it will also be > possible to provide an SLA for these disable operations. > > I hope the user case is clear. > > Regards, > Bharath Rupireddy. > >
Re: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments
From
Bharath Rupireddy
Date:
On Fri, Jun 10, 2022 at 8:42 PM Hsu, John <hsuchen@amazon.com> wrote: > > Hi, > > Why couldn't you terminate the active_pid associated with the slot you > want to drop if it's active prior to dropping? In that case, the slot becomes active immediately after killing the old walsender because the standby/subscriber opens another connection with the primary using the same replication slot. The replication slot will be inactive for a moment during pg_terminate_backend and becomes active again by the time we call pg_drop_replication_slot and we hit the same ERROR: replication slot "foo" is active for PID XXXXX. The idea proposed here is to have a force-drop function that terminates the walsender gracefully and drops the replication slot even though there's somebody using it and all of this is done with an exclusive lock on the slot so that nobody can acquire it while we are dropping it. Regards, Bharath Rupireddy.