Thread: Avoiding data loss with synchronous replication
Hi hackers, As previously discussed [0], canceling synchronous replication waits can have the unfortunate side effect of making transactions visible on a primary server before they are replicated. A failover at this time would cause such transactions to be lost. The proposed solution in the previous thread [0] involved blocking such cancellations, but many had concerns about that approach (e.g., backends could be unresponsive, server restarts were still affected by this problem). I would like to propose something more like what Fujii-san suggested [1] that would avoid blocking cancellations while still preventing data loss. I believe this is a key missing piece of the synchronous replication functionality in PostgreSQL. AFAICT there are a variety of ways that the aforementioned problem may occur: 1. Server restarts: As noted in the docs [2], "waiting transactions will be marked fully committed once the primary database recovers." I think there are a few options for handling this, but the simplest would be to simply failover anytime the primary server shut down. My proposal may offer other ways of helping with this. 2. Backend crashes: If a backend crashes, the postmaster process will restart everything, leading to the same problem described in 1. However, this behavior can be prevented with the restart_after_crash parameter [3]. 3. Client disconnections: During waits for synchronous replication, interrupt processing is turned off, so disconnected clients actually don't seem to cause a problem. The server will still wait for synchronous replication to complete prior to making the transaction visible on the primary. 4. Query cancellations and backend terminations: This appears to be the only gap where there is no way to avoid potential data loss, and it is the main target of my proposal. Instead of blocking query cancellations and backend terminations, I think we should allow them to proceed, but we should keep the transactions marked in-progress so they do not yet become visible to sessions on the primary. Once replication has caught up to the the necessary point, the transactions can be marked completed, and they would finally become visible. The main advantages of this approach are 1) it still allows for canceling waits for synchronous replication and 2) it provides an opportunity to view and manage waits for synchronous replication outside of the standard cancellation/termination functionality. The tooling for 2 could even allow a session to begin waiting for synchronous replication again if it "inadvertently interrupted a replication wait..." [4]. I think the main disadvantage of this approach is that transactions committed by a session may not be immediately visible to the session when the command returns after canceling the wait for synchronous replication. Instead, the transactions would become visible in the future once the change is replicated. This may cause problems for an application if it doesn't handle this scenario carefully. What are folks' opinions on this idea? Is this something that is worth prototyping? Nathan [0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru [1] https://www.postgresql.org/message-id/4f8d54c9-6f18-23d5-c4de-9d6656d3a408%40oss.nttdata.com [2] https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA [3] https://www.postgresql.org/docs/devel/runtime-config-error-handling.html#GUC-RESTART-AFTER-CRASH [4] https://www.postgresql.org/message-id/CA%2BTgmoZpwBEyPDZixeHN9ZeNJJjd3EBEQ8nJPaRAsVexhssfNg%40mail.gmail.com
On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > Hi hackers, > > As previously discussed [0], canceling synchronous replication waits > can have the unfortunate side effect of making transactions visible on > a primary server before they are replicated. A failover at this time > would cause such transactions to be lost. The proposed solution in > the previous thread [0] involved blocking such cancellations, but many > had concerns about that approach (e.g., backends could be > unresponsive, server restarts were still affected by this problem). I > would like to propose something more like what Fujii-san suggested [1] > that would avoid blocking cancellations while still preventing data > loss. I believe this is a key missing piece of the synchronous > replication functionality in PostgreSQL. > > AFAICT there are a variety of ways that the aforementioned problem may > occur: > 1. Server restarts: As noted in the docs [2], "waiting transactions > will be marked fully committed once the primary database > recovers." I think there are a few options for handling this, > but the simplest would be to simply failover anytime the primary > server shut down. My proposal may offer other ways of helping > with this. > 2. Backend crashes: If a backend crashes, the postmaster process > will restart everything, leading to the same problem described in > 1. However, this behavior can be prevented with the > restart_after_crash parameter [3]. > 3. Client disconnections: During waits for synchronous replication, > interrupt processing is turned off, so disconnected clients > actually don't seem to cause a problem. The server will still > wait for synchronous replication to complete prior to making the > transaction visible on the primary. > 4. Query cancellations and backend terminations: This appears to be > the only gap where there is no way to avoid potential data loss, > and it is the main target of my proposal. > > Instead of blocking query cancellations and backend terminations, I > think we should allow them to proceed, but we should keep the > transactions marked in-progress so they do not yet become visible to > sessions on the primary. > One naive question, what if the primary gets some error while changing the status from in-progress to committed? Won't in such a case the transaction will be visible on standby but not on the primary? > Once replication has caught up to the > the necessary point, the transactions can be marked completed, and > they would finally become visible. > If the session issued the commit is terminated, will this work be done by some background process? -- With Regards, Amit Kapila.
On Thu, 2021-07-22 at 21:17 +0000, Bossart, Nathan wrote: > As previously discussed [0], canceling synchronous replication waits > can have the unfortunate side effect of making transactions visible on > a primary server before they are replicated. A failover at this time > would cause such transactions to be lost. > > AFAICT there are a variety of ways that the aforementioned problem may > occur: > 4. Query cancellations and backend terminations: This appears to be > the only gap where there is no way to avoid potential data loss, > and it is the main target of my proposal. > > Instead of blocking query cancellations and backend terminations, I > think we should allow them to proceed, but we should keep the > transactions marked in-progress so they do not yet become visible to > sessions on the primary. Once replication has caught up to the > the necessary point, the transactions can be marked completed, and > they would finally become visible. > > The main advantages of this approach are 1) it still allows for > canceling waits for synchronous replication and 2) it provides an > opportunity to view and manage waits for synchronous replication > outside of the standard cancellation/termination functionality. The > tooling for 2 could even allow a session to begin waiting for > synchronous replication again if it "inadvertently interrupted a > replication wait..." [4]. I think the main disadvantage of this > approach is that transactions committed by a session may not be > immediately visible to the session when the command returns after > canceling the wait for synchronous replication. Instead, the > transactions would become visible in the future once the change is > replicated. This may cause problems for an application if it doesn't > handle this scenario carefully. > > What are folks' opinions on this idea? Is this something that is > worth prototyping? But that would mean that changes ostensibly rolled back (because the cancel request succeeded) will later turn out to be committed after all, just like it is now (only later). Where is the advantage? Besides, there is no room for another transaction status in the commit log. Yours, Laurenz Albe
Hi Nathan! Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design. The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend(). > 23 июля 2021 г., в 02:17, Bossart, Nathan <bossartn@amazon.com> написал(а): > > Hi hackers, > > As previously discussed [0], canceling synchronous replication waits > can have the unfortunate side effect of making transactions visible on > a primary server before they are replicated. A failover at this time > would cause such transactions to be lost. The proposed solution in > the previous thread [0] involved blocking such cancellations, but many > had concerns about that approach (e.g., backends could be > unresponsive, server restarts were still affected by this problem). I > would like to propose something more like what Fujii-san suggested [1] > that would avoid blocking cancellations while still preventing data > loss. I believe this is a key missing piece of the synchronous > replication functionality in PostgreSQL. > > AFAICT there are a variety of ways that the aforementioned problem may > occur: > 1. Server restarts: As noted in the docs [2], "waiting transactions > will be marked fully committed once the primary database > recovers." I think there are a few options for handling this, > but the simplest would be to simply failover anytime the primary > server shut down. My proposal may offer other ways of helping > with this. I think simple check that no other primary exists would suffice. Currently this is totally concern of HA-tool. > 2. Backend crashes: If a backend crashes, the postmaster process > will restart everything, leading to the same problem described in > 1. However, this behavior can be prevented with the > restart_after_crash parameter [3]. > 3. Client disconnections: During waits for synchronous replication, > interrupt processing is turned off, so disconnected clients > actually don't seem to cause a problem. The server will still > wait for synchronous replication to complete prior to making the > transaction visible on the primary. +1. > 4. Query cancellations and backend terminations: This appears to be > the only gap where there is no way to avoid potential data loss, > and it is the main target of my proposal. > > Instead of blocking query cancellations and backend terminations, I > think we should allow them to proceed, but we should keep the > transactions marked in-progress so they do not yet become visible to > sessions on the primary. Once replication has caught up to the > the necessary point, the transactions can be marked completed, and > they would finally become visible. > > The main advantages of this approach are 1) it still allows for > canceling waits for synchronous replication You can cancel synchronous replication by ALTER SYSTEM SET synchnorou_standby_names to 'new quorum'; SELECT pg_reload_conf(); All backends waiting for sync rep will proceed with new quorum. > and 2) it provides an > opportunity to view and manage waits for synchronous replication > outside of the standard cancellation/termination functionality. The > tooling for 2 could even allow a session to begin waiting for > synchronous replication again if it "inadvertently interrupted a > replication wait..." [4]. I think the main disadvantage of this > approach is that transactions committed by a session may not be > immediately visible to the session when the command returns after > canceling the wait for synchronous replication. Instead, the > transactions would become visible in the future once the change is > replicated. This may cause problems for an application if it doesn't > handle this scenario carefully. > > What are folks' opinions on this idea? Is this something that is > worth prototyping? In fact you propose converting transaction to 2PC if we get CANCEL during sync rep wait. Transferring locks and other stuff somewhere, acquiring new VXid to our backend, sending CommandComplete while it's not infact complete etc. I think it's kind of overly complex for provided reasons. The ultimate reason of synchronous replication is to make a client wait when it's necessary to wait. If the client wish toexecute more commands they can open new connection or set synchronous_commit to desired level in first place. Cancelingcommitted locally transaction will not be possible anyway. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d
On 7/23/21, 3:58 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote: > On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan <bossartn@amazon.com> wrote: >> Instead of blocking query cancellations and backend terminations, I >> think we should allow them to proceed, but we should keep the >> transactions marked in-progress so they do not yet become visible to >> sessions on the primary. >> > > One naive question, what if the primary gets some error while changing > the status from in-progress to committed? Won't in such a case the > transaction will be visible on standby but not on the primary? Yes. In this case, the transaction would remain in-progress on the primary until it can be marked committed. >> Once replication has caught up to the >> the necessary point, the transactions can be marked completed, and >> they would finally become visible. >> > > If the session issued the commit is terminated, will this work be done > by some background process? I think the way I'm imagining it is that a background process would be responsible for handling all of the "offloaded" transactions. I'm not wedded to any particular design at this point, though. Nathan
On 7/23/21, 4:23 AM, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote: > But that would mean that changes ostensibly rolled back (because the > cancel request succeeded) will later turn out to be committed after all, > just like it is now (only later). Where is the advantage? The advantage is that I can cancel waits for synchronous replication without risking data loss. The transactions would still be marked in- progress until we get the proper acknowledgement from the standbys. > Besides, there is no room for another transaction status in the > commit log. Right. Like the existing synchronous replication functionality, the commit log would be updated, but the transactions would still appear to be in-progress. Today, this is done via the procarray. Nathan
On 7/23/21, 4:33 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote: > Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design. > The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend(). I didn't get the idea that there was a tremendous amount of support for the approach to block canceling waits for synchronous replication. FWIW this was my initial approach as well, but I've been trying to think of alternatives. If we can gather support for some variation of the block-cancels approach, I think that would be preferred over my proposal from a complexity standpoint. Robert's idea to provide a way to understand the intent of the cancellation/termination request [0] could improve matters. Perhaps adding an argument to pg_cancel/terminate_backend() and using different signals to indicate that we want to cancel the wait would be something that folks could get on board with. Nathan [0] https://www.postgresql.org/message-id/CA%2BTgmoaW8syC_wqQcsJ%3DsQ0gTbFVC6MqYmxbwNHk5w%3DxJ-McOQ%40mail.gmail.com
> 23 июля 2021 г., в 22:54, Bossart, Nathan <bossartn@amazon.com> написал(а): > > On 7/23/21, 4:33 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote: >> Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design. >> The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend(). > > I didn't get the idea that there was a tremendous amount of support > for the approach to block canceling waits for synchronous replication. > FWIW this was my initial approach as well, but I've been trying to > think of alternatives. > > If we can gather support for some variation of the block-cancels > approach, I think that would be preferred over my proposal from a > complexity standpoint. Let's clearly enumerate problems of blocking. It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive. postgres=# alter system set synchronous_standby_names to 'bogus'; ALTER SYSTEM postgres=# alter system set synchronous_commit_cancelation TO off ; ALTER SYSTEM postgres=# select pg_reload_conf(); 2021-07-24 15:35:03.054 +05 [10452] LOG: received SIGHUP, reloading configuration files l --- t (1 row) postgres=# begin; BEGIN postgres=*# insert into t1 values(0); INSERT 0 1 postgres=*# commit ; ^CCancel request sent WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We mustwait here. ^CCancel request sent WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We mustwait here. It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names. Are there any other problems with blocking cancels? > Robert's idea to provide a way to understand > the intent of the cancellation/termination request [0] could improve > matters. Perhaps adding an argument to pg_cancel/terminate_backend() > and using different signals to indicate that we want to cancel the > wait would be something that folks could get on board with. Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. Therecannot be any correct cancelation. And I don't think it worth to add incorrect cancelation. Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties thattransaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots... Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But theycancel queries by default. Thanks! Best regards, Andrey Borodin.
Hi, On 2021-07-22 21:17:56 +0000, Bossart, Nathan wrote: > AFAICT there are a variety of ways that the aforementioned problem may > occur: > 1. Server restarts: As noted in the docs [2], "waiting transactions > will be marked fully committed once the primary database > recovers." I think there are a few options for handling this, > but the simplest would be to simply failover anytime the primary > server shut down. My proposal may offer other ways of helping > with this. > 2. Backend crashes: If a backend crashes, the postmaster process > will restart everything, leading to the same problem described in > 1. However, this behavior can be prevented with the > restart_after_crash parameter [3]. > 3. Client disconnections: During waits for synchronous replication, > interrupt processing is turned off, so disconnected clients > actually don't seem to cause a problem. The server will still > wait for synchronous replication to complete prior to making the > transaction visible on the primary. > 4. Query cancellations and backend terminations: This appears to be > the only gap where there is no way to avoid potential data loss, > and it is the main target of my proposal. > > Instead of blocking query cancellations and backend terminations, I > think we should allow them to proceed, but we should keep the > transactions marked in-progress so they do not yet become visible to > sessions on the primary. Once replication has caught up to the > the necessary point, the transactions can be marked completed, and > they would finally become visible. I think there's two aspects making this proposal problematic: First, from the user experience side of things, the issue is that this seems to propose violating read-your-own-writes. Within a single connection to a single node. Which imo is *far* worse than seeing writes that haven't yet been acknowledged as replicated after a query cancel. Second, on the implementation side, I think this proposal practically amounts to internally converting plain transaction commits into 2PC prepare/commit. With all the associated overhead (two WAL entries/flushes per commit, needing a separate set of procarray entries to hold the resources for the the prepared-but-not-committed transactions, potential for running out of the extra procarray slots). What if a user rapidly commits-cancels in a loop? You'll almost immediately run out of procarray slots to represent all those "not really committed" transactions. I think there's benefit in optionally turning all transactions into 2PC ones, but I don't see it being fast enough to be the only option. Greetings, Andres Freund
Hi, On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote: > Are there any other problems with blocking cancels? Unless you have commandline access to the server, it's not hard to get into a situation where you can't change the configuration setting because all connections are hanging, and you can't even log in to do an ALTER SERVER etc. You can't kill applications to kill the connection, because they will just continue to hang. Greetings, Andres Freund
25 июля 2021 г., в 05:29, Andres Freund <andres@anarazel.de> написал(а):
Hi,
On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote:Are there any other problems with blocking cancels?
Unless you have commandline access to the server, it's not hard to get
into a situation where you can't change the configuration setting
because all connections are hanging, and you can't even log in to do an
ALTER SERVER etc. You can't kill applications to kill the connection,
because they will just continue to hang.
Hmm, yes, it's not hard to get to this situation. Intentionally. But what would be setup to get into such troubles? Setting sync rep, but not configuring HA tool?
In normal circumstances HA cluster is not configured to allow this. Normally hanging commits are part of the failover. Somewhere new primary server is operating. You have to find commandline access to the server to execute pg_rewind, and join this node to cluster again as a standby.
Anyway it's a good idea to set up superuser_reserved_connections for administrative intervention [0].
I like the idea of transferring transaction locks somewhere until synchronous_commit requirements are satisfied. It makes us closer to making this locks durable to survive restart. But, IMO, the complexity and potentially dangerous conditions outweigh the benefits of this approach easily.
Thanks!
Best regards, Andrey Borodin.
[0]
Attachment
On 7/24/21, 3:54 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote: > Let's clearly enumerate problems of blocking. > It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive. It is responsive in the sense that it emits a WARNING to the client whose backend received the request. However, it is still not responsive in other aspects. The backend won't take the requested action, and if the action was requested via pg_cancel/terminate_backend(), no useful feedback is provided to the user to explain why it is blocked. > Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. Therecannot be any correct cancelation. And I don't think it worth to add incorrect cancelation. The latest version of the block-cancels patch that I've seen still allows you to cancel things if you really want to. For example, you can completely turn off synchronous replication by unsetting synchronous_standby_names. Alternatively, you could store the value of the new block-cancels parameter in shared memory and simply turn that off to allow cancellations to proceed. In either case, a user is forced to change the settings for the whole server. I think allowing users to target a specific synchronous replication wait is useful. Even if I want to block canceling waits for most queries, perhaps I am okay with unblocking an administrative session that is stuck trying to update a password (although that could also be achieved by remembering to set synchronous_commit = local). What do you think about allowing multiple sets of behavior with the new parameter? The "always allow" value would make things work just like they do today. The "when specifically requested" value would allow users to use a new mechanism (perhaps new signals) to intentionally cancel synchronous replication waits. And the "always block" value would disallow blocking such waits without altering the server-wide settings. Nathan
Attachment
On 7/24/21, 5:25 PM, "Andres Freund" <andres@anarazel.de> wrote: > First, from the user experience side of things, the issue is that this seems > to propose violating read-your-own-writes. Within a single connection to a > single node. Which imo is *far* worse than seeing writes that haven't yet been > acknowledged as replicated after a query cancel. Right. I suspect others will have a similar opinion. Nathan