Thread: Avoiding data loss with synchronous replication

Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

22 July 2021, 21:17:56

Hi hackers,

As previously discussed [0], canceling synchronous replication waits
can have the unfortunate side effect of making transactions visible on
a primary server before they are replicated. A failover at this time
would cause such transactions to be lost. The proposed solution in
the previous thread [0] involved blocking such cancellations, but many
had concerns about that approach (e.g., backends could be
unresponsive, server restarts were still affected by this problem). I
would like to propose something more like what Fujii-san suggested [1]
that would avoid blocking cancellations while still preventing data
loss. I believe this is a key missing piece of the synchronous
replication functionality in PostgreSQL.

AFAICT there are a variety of ways that the aforementioned problem may
occur:
1. Server restarts: As noted in the docs [2], "waiting transactions
will be marked fully committed once the primary database
recovers." I think there are a few options for handling this,
but the simplest would be to simply failover anytime the primary
server shut down. My proposal may offer other ways of helping
with this.
2. Backend crashes: If a backend crashes, the postmaster process
will restart everything, leading to the same problem described in
1. However, this behavior can be prevented with the
restart_after_crash parameter [3].
3. Client disconnections: During waits for synchronous replication,
interrupt processing is turned off, so disconnected clients
actually don't seem to cause a problem. The server will still
wait for synchronous replication to complete prior to making the
transaction visible on the primary.
4. Query cancellations and backend terminations: This appears to be
the only gap where there is no way to avoid potential data loss,
and it is the main target of my proposal.

Instead of blocking query cancellations and backend terminations, I
think we should allow them to proceed, but we should keep the
transactions marked in-progress so they do not yet become visible to
sessions on the primary. Once replication has caught up to the
the necessary point, the transactions can be marked completed, and
they would finally become visible.

The main advantages of this approach are 1) it still allows for
canceling waits for synchronous replication and 2) it provides an
opportunity to view and manage waits for synchronous replication
outside of the standard cancellation/termination functionality. The
tooling for 2 could even allow a session to begin waiting for
synchronous replication again if it "inadvertently interrupted a
replication wait..." [4]. I think the main disadvantage of this
approach is that transactions committed by a session may not be
immediately visible to the session when the command returns after
canceling the wait for synchronous replication. Instead, the
transactions would become visible in the future once the change is
replicated. This may cause problems for an application if it doesn't
handle this scenario carefully.

What are folks' opinions on this idea? Is this something that is
worth prototyping?

Nathan

[0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru
[1] https://www.postgresql.org/message-id/4f8d54c9-6f18-23d5-c4de-9d6656d3a408%40oss.nttdata.com
[2] https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA
[3] https://www.postgresql.org/docs/devel/runtime-config-error-handling.html#GUC-RESTART-AFTER-CRASH
[4] https://www.postgresql.org/message-id/CA%2BTgmoZpwBEyPDZixeHN9ZeNJJjd3EBEQ8nJPaRAsVexhssfNg%40mail.gmail.com

Re: Avoiding data loss with synchronous replication

From

Amit Kapila

Date:

23 July 2021, 10:58:03

On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> Hi hackers,
>
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.  The proposed solution in
> the previous thread [0] involved blocking such cancellations, but many
> had concerns about that approach (e.g., backends could be
> unresponsive, server restarts were still affected by this problem).  I
> would like to propose something more like what Fujii-san suggested [1]
> that would avoid blocking cancellations while still preventing data
> loss.  I believe this is a key missing piece of the synchronous
> replication functionality in PostgreSQL.
>
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   1. Server restarts: As noted in the docs [2], "waiting transactions
>      will be marked fully committed once the primary database
>      recovers."  I think there are a few options for handling this,
>      but the simplest would be to simply failover anytime the primary
>      server shut down.  My proposal may offer other ways of helping
>      with this.
>   2. Backend crashes: If a backend crashes, the postmaster process
>      will restart everything, leading to the same problem described in
>      1.  However, this behavior can be prevented with the
>      restart_after_crash parameter [3].
>   3. Client disconnections: During waits for synchronous replication,
>      interrupt processing is turned off, so disconnected clients
>      actually don't seem to cause a problem.  The server will still
>      wait for synchronous replication to complete prior to making the
>      transaction visible on the primary.
>   4. Query cancellations and backend terminations: This appears to be
>      the only gap where there is no way to avoid potential data loss,
>      and it is the main target of my proposal.
>
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.
>

One naive question, what if the primary gets some error while changing
the status from in-progress to committed? Won't in such a case the
transaction will be visible on standby but not on the primary?

>  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
>

If the session issued the commit is terminated, will this work be done
by some background process?

-- 
With Regards,
Amit Kapila.

Re: Avoiding data loss with synchronous replication

From

Laurenz Albe

Date:

23 July 2021, 11:22:42

On Thu, 2021-07-22 at 21:17 +0000, Bossart, Nathan wrote:
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.
> 
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   4. Query cancellations and backend terminations: This appears to be
>      the only gap where there is no way to avoid potential data loss,
>      and it is the main target of my proposal.
> 
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
> 
> The main advantages of this approach are 1) it still allows for
> canceling waits for synchronous replication and 2) it provides an
> opportunity to view and manage waits for synchronous replication
> outside of the standard cancellation/termination functionality.  The
> tooling for 2 could even allow a session to begin waiting for
> synchronous replication again if it "inadvertently interrupted a
> replication wait..." [4].  I think the main disadvantage of this
> approach is that transactions committed by a session may not be
> immediately visible to the session when the command returns after
> canceling the wait for synchronous replication.  Instead, the
> transactions would become visible in the future once the change is
> replicated.  This may cause problems for an application if it doesn't
> handle this scenario carefully.
> 
> What are folks' opinions on this idea?  Is this something that is
> worth prototyping?

But that would mean that changes ostensibly rolled back (because the
cancel request succeeded) will later turn out to be committed after all,
just like it is now (only later).  Where is the advantage?

Besides, there is no room for another transaction status in the
commit log.

Yours,
Laurenz Albe

Re: Avoiding data loss with synchronous replication

From

Andrey Borodin

Date:

23 July 2021, 11:32:08

Hi Nathan!

Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.
The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().

> 23 июля 2021 г., в 02:17, Bossart, Nathan <bossartn@amazon.com> написал(а):
>
> Hi hackers,
>
> As previously discussed [0], canceling synchronous replication waits
> can have the unfortunate side effect of making transactions visible on
> a primary server before they are replicated.  A failover at this time
> would cause such transactions to be lost.  The proposed solution in
> the previous thread [0] involved blocking such cancellations, but many
> had concerns about that approach (e.g., backends could be
> unresponsive, server restarts were still affected by this problem).  I
> would like to propose something more like what Fujii-san suggested [1]
> that would avoid blocking cancellations while still preventing data
> loss.  I believe this is a key missing piece of the synchronous
> replication functionality in PostgreSQL.
>
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>  1. Server restarts: As noted in the docs [2], "waiting transactions
>     will be marked fully committed once the primary database
>     recovers."  I think there are a few options for handling this,
>     but the simplest would be to simply failover anytime the primary
>     server shut down.  My proposal may offer other ways of helping
>     with this.
I think simple check that no other primary exists would suffice.
Currently this is totally concern of HA-tool.

>  2. Backend crashes: If a backend crashes, the postmaster process
>     will restart everything, leading to the same problem described in
>     1.  However, this behavior can be prevented with the
>     restart_after_crash parameter [3].
>  3. Client disconnections: During waits for synchronous replication,
>     interrupt processing is turned off, so disconnected clients
>     actually don't seem to cause a problem.  The server will still
>     wait for synchronous replication to complete prior to making the
>     transaction visible on the primary.
+1.

>  4. Query cancellations and backend terminations: This appears to be
>     the only gap where there is no way to avoid potential data loss,
>     and it is the main target of my proposal.
>
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.
>
> The main advantages of this approach are 1) it still allows for
> canceling waits for synchronous replication
You can cancel synchronous replication by
ALTER SYSTEM SET synchnorou_standby_names to 'new quorum';
SELECT pg_reload_conf();

All backends waiting for sync rep will proceed with new quorum.

> and 2) it provides an
> opportunity to view and manage waits for synchronous replication
> outside of the standard cancellation/termination functionality.  The
> tooling for 2 could even allow a session to begin waiting for
> synchronous replication again if it "inadvertently interrupted a
> replication wait..." [4].  I think the main disadvantage of this
> approach is that transactions committed by a session may not be
> immediately visible to the session when the command returns after
> canceling the wait for synchronous replication.  Instead, the
> transactions would become visible in the future once the change is
> replicated.  This may cause problems for an application if it doesn't
> handle this scenario carefully.
>
> What are folks' opinions on this idea?  Is this something that is
> worth prototyping?

In fact you propose converting transaction to 2PC if we get CANCEL during sync rep wait.
Transferring locks and other stuff somewhere, acquiring new VXid to our backend, sending CommandComplete while it's not
infact complete etc. 
I think it's kind of overly complex for provided reasons.

The ultimate reason of synchronous replication is to make a client wait when it's necessary to wait. If the client wish
toexecute more commands they can open new connection or set synchronous_commit to desired level in first place.
Cancelingcommitted locally transaction will not be possible anyway. 

Thanks!

Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d

Re: Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

23 July 2021, 17:53:21

On 7/23/21, 3:58 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 23, 2021 at 2:48 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>> Instead of blocking query cancellations and backend terminations, I
>> think we should allow them to proceed, but we should keep the
>> transactions marked in-progress so they do not yet become visible to
>> sessions on the primary.
>>
>
> One naive question, what if the primary gets some error while changing
> the status from in-progress to committed? Won't in such a case the
> transaction will be visible on standby but not on the primary?

Yes.  In this case, the transaction would remain in-progress on the
primary until it can be marked committed.

>>  Once replication has caught up to the
>> the necessary point, the transactions can be marked completed, and
>> they would finally become visible.
>>
>
> If the session issued the commit is terminated, will this work be done
> by some background process?

I think the way I'm imagining it is that a background process would be
responsible for handling all of the "offloaded" transactions.  I'm not
wedded to any particular design at this point, though.

Nathan

Re: Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

23 July 2021, 17:53:47

On 7/23/21, 4:23 AM, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:
> But that would mean that changes ostensibly rolled back (because the
> cancel request succeeded) will later turn out to be committed after all,
> just like it is now (only later).  Where is the advantage?

The advantage is that I can cancel waits for synchronous replication
without risking data loss.  The transactions would still be marked in-
progress until we get the proper acknowledgement from the standbys.

> Besides, there is no room for another transaction status in the
> commit log.

Right.  Like the existing synchronous replication functionality, the
commit log would be updated, but the transactions would still appear
to be in-progress.  Today, this is done via the procarray.

Nathan

Re: Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

23 July 2021, 17:54:20

On 7/23/21, 4:33 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote:
> Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.
> The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().

I didn't get the idea that there was a tremendous amount of support
for the approach to block canceling waits for synchronous replication.
FWIW this was my initial approach as well, but I've been trying to
think of alternatives.

If we can gather support for some variation of the block-cancels
approach, I think that would be preferred over my proposal from a
complexity standpoint.  Robert's idea to provide a way to understand
the intent of the cancellation/termination request [0] could improve
matters.  Perhaps adding an argument to pg_cancel/terminate_backend()
and using different signals to indicate that we want to cancel the
wait would be something that folks could get on board with.

Nathan

[0] https://www.postgresql.org/message-id/CA%2BTgmoaW8syC_wqQcsJ%3DsQ0gTbFVC6MqYmxbwNHk5w%3DxJ-McOQ%40mail.gmail.com

Re: Avoiding data loss with synchronous replication

From

Andrey Borodin

Date:

24 July 2021, 10:53:15


> 23 июля 2021 г., в 22:54, Bossart, Nathan <bossartn@amazon.com> написал(а):
>
> On 7/23/21, 4:33 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote:
>> Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.
>> The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().
>
> I didn't get the idea that there was a tremendous amount of support
> for the approach to block canceling waits for synchronous replication.
> FWIW this was my initial approach as well, but I've been trying to
> think of alternatives.
>
> If we can gather support for some variation of the block-cancels
> approach, I think that would be preferred over my proposal from a
> complexity standpoint.
Let's clearly enumerate problems of blocking.
It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very
responsive.

postgres=# alter system set synchronous_standby_names to 'bogus';
ALTER SYSTEM
postgres=# alter system set synchronous_commit_cancelation TO off ;
ALTER SYSTEM
postgres=# select pg_reload_conf();
2021-07-24 15:35:03.054 +05 [10452] LOG:  received SIGHUP, reloading configuration files
l
---
t
(1 row)
postgres=# begin;
BEGIN
postgres=*# insert into t1 values(0);
INSERT 0 1
postgres=*# commit ;
^CCancel request sent
WARNING:  canceling wait for synchronous replication requested, but cancelation is not allowed
DETAIL:  The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We
mustwait here. 
^CCancel request sent
WARNING:  canceling wait for synchronous replication requested, but cancelation is not allowed
DETAIL:  The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We
mustwait here. 

It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names.

Are there any other problems with blocking cancels?


> Robert's idea to provide a way to understand
> the intent of the cancellation/termination request [0] could improve
> matters.  Perhaps adding an argument to pg_cancel/terminate_backend()
> and using different signals to indicate that we want to cancel the
> wait would be something that folks could get on board with.

Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally.
Therecannot be any correct cancelation. And I don't think it worth to add incorrect cancelation. 


Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties
thattransaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots... 
Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But
theycancel queries by default. 


Thanks!

Best regards, Andrey Borodin.

Re: Avoiding data loss with synchronous replication

From

Andres Freund

Date:

25 July 2021, 00:24:32

Hi,

On 2021-07-22 21:17:56 +0000, Bossart, Nathan wrote:
> AFAICT there are a variety of ways that the aforementioned problem may
> occur:
>   1. Server restarts: As noted in the docs [2], "waiting transactions
>      will be marked fully committed once the primary database
>      recovers."  I think there are a few options for handling this,
>      but the simplest would be to simply failover anytime the primary
>      server shut down.  My proposal may offer other ways of helping
>      with this.
>   2. Backend crashes: If a backend crashes, the postmaster process
>      will restart everything, leading to the same problem described in
>      1.  However, this behavior can be prevented with the
>      restart_after_crash parameter [3].
>   3. Client disconnections: During waits for synchronous replication,
>      interrupt processing is turned off, so disconnected clients
>      actually don't seem to cause a problem.  The server will still
>      wait for synchronous replication to complete prior to making the
>      transaction visible on the primary.
>   4. Query cancellations and backend terminations: This appears to be
>      the only gap where there is no way to avoid potential data loss,
>      and it is the main target of my proposal.
> 
> Instead of blocking query cancellations and backend terminations, I
> think we should allow them to proceed, but we should keep the
> transactions marked in-progress so they do not yet become visible to
> sessions on the primary.  Once replication has caught up to the
> the necessary point, the transactions can be marked completed, and
> they would finally become visible.

I think there's two aspects making this proposal problematic:

First, from the user experience side of things, the issue is that this seems
to propose violating read-your-own-writes. Within a single connection to a
single node. Which imo is *far* worse than seeing writes that haven't yet been
acknowledged as replicated after a query cancel.

Second, on the implementation side, I think this proposal practically amounts
to internally converting plain transaction commits into 2PC
prepare/commit. With all the associated overhead (two WAL entries/flushes per
commit, needing a separate set of procarray entries to hold the resources for
the the prepared-but-not-committed transactions, potential for running out of
the extra procarray slots). What if a user rapidly commits-cancels in a loop?
You'll almost immediately run out of procarray slots to represent all those
"not really committed" transactions.

I think there's benefit in optionally turning all transactions into 2PC ones,
but I don't see it being fast enough to be the only option.

Greetings,

Andres Freund

Re: Avoiding data loss with synchronous replication

From

Andres Freund

Date:

25 July 2021, 00:29:06

Hi,

On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote:
> Are there any other problems with blocking cancels?

Unless you have commandline access to the server, it's not hard to get
into a situation where you can't change the configuration setting
because all connections are hanging, and you can't even log in to do an
ALTER SERVER etc. You can't kill applications to kill the connection,
because they will just continue to hang.

Greetings,

Andres Freund

Re: Avoiding data loss with synchronous replication

From

Andrey Borodin

Date:

25 July 2021, 09:13:18

25 июля 2021 г., в 05:29, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On 2021-07-24 15:53:15 +0500, Andrey Borodin wrote:
Are there any other problems with blocking cancels?

Unless you have commandline access to the server, it's not hard to get
into a situation where you can't change the configuration setting
because all connections are hanging, and you can't even log in to do an
ALTER SERVER etc. You can't kill applications to kill the connection,
because they will just continue to hang.

Hmm, yes, it's not hard to get to this situation. Intentionally. But what would be setup to get into such troubles? Setting sync rep, but not configuring HA tool?

In normal circumstances HA cluster is not configured to allow this. Normally hanging commits are part of the failover. Somewhere new primary server is operating. You have to find commandline access to the server to execute pg_rewind, and join this node to cluster again as a standby.

Anyway it's a good idea to set up superuser_reserved_connections for administrative intervention [0].

I like the idea of transferring transaction locks somewhere until synchronous_commit requirements are satisfied. It makes us closer to making this locks durable to survive restart. But, IMO, the complexity and potentially dangerous conditions outweigh the benefits of this approach easily.

Thanks!

Best regards, Andrey Borodin.

[0]

Attachment

fig1.jpg

Re: Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

26 July 2021, 17:08:52

On 7/24/21, 3:54 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote:
> Let's clearly enumerate problems of blocking.
> It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very
responsive.

It is responsive in the sense that it emits a WARNING to the client
whose backend received the request. However, it is still not
responsive in other aspects. The backend won't take the requested
action, and if the action was requested via
pg_cancel/terminate_backend(), no useful feedback is provided to the
user to explain why it is blocked.

> Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally.
Therecannot be any correct cancelation. And I don't think it worth to add incorrect cancelation.

The latest version of the block-cancels patch that I've seen still
allows you to cancel things if you really want to. For example, you
can completely turn off synchronous replication by unsetting
synchronous_standby_names. Alternatively, you could store the value
of the new block-cancels parameter in shared memory and simply turn
that off to allow cancellations to proceed. In either case, a user is
forced to change the settings for the whole server. I think allowing
users to target a specific synchronous replication wait is useful.
Even if I want to block canceling waits for most queries, perhaps I am
okay with unblocking an administrative session that is stuck trying to
update a password (although that could also be achieved by remembering
to set synchronous_commit = local).

What do you think about allowing multiple sets of behavior with the
new parameter? The "always allow" value would make things work just
like they do today. The "when specifically requested" value would
allow users to use a new mechanism (perhaps new signals) to
intentionally cancel synchronous replication waits. And the "always
block" value would disallow blocking such waits without altering the
server-wide settings.

Nathan

Attachment

fig1.jpg

Re: Avoiding data loss with synchronous replication

From

"Bossart, Nathan"

Date:

26 July 2021, 17:14:47

On 7/24/21, 5:25 PM, "Andres Freund" <andres@anarazel.de> wrote:
> First, from the user experience side of things, the issue is that this seems
> to propose violating read-your-own-writes. Within a single connection to a
> single node. Which imo is *far* worse than seeing writes that haven't yet been
> acknowledged as replicated after a query cancel.

Right.  I suspect others will have a similar opinion.

Nathan