Thread: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Hi,

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well. If the ack isn't received even within this time frame, the
backend cancels the wait and goes ahead as it does today. In
production HA environments, the GUC can be set to a reasonable value
to avoid missing transactions during failovers.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Thoughts?

Here's a WIP patch implementing the (1), I'm yet to code for (2). I
haven't added tests, I'm yet to figure out how to add one as there's
no way we can delay the WAL sender so that we can reliably hit this
code. I will think more about this.

[1] https://www.postgresql.org/message-id/CAHg%2BQDdTdPsqtu0QLG8rMg3Xo%3D6Xo23TwHPYsUgGNEK13wTY5g%40mail.gmail.com

Regards,
Bharath Rupireddy.

Attachment
On Mon, Apr 25, 2022 at 07:51:03PM +0530, Bharath Rupireddy wrote:
> With synchronous replication typically all the transactions (txns)
> first locally get committed, then streamed to the sync standbys and
> the backend that generated the transaction will wait for ack from sync
> standbys. While waiting for ack, it may happen that the query or the
> txn gets canceled (QueryCancelPending is true) or the waiting backend
> is asked to exit (ProcDiePending is true). In either of these cases,
> the wait for ack gets canceled and leaves the txn in an inconsistent
> state (as in the client thinks that the txn would have replicated to
> sync standbys) - "The transaction has already committed locally, but
> might not have been replicated to the standby.". Upon restart after
> the crash or in the next txn after the old locally committed txn was
> canceled, the client will be able to see the txns that weren't
> actually streamed to sync standbys. Also, if the client fails over to
> one of the sync standbys after the crash (either by choice or because
> of automatic failover management after crash), the locally committed
> txns on the crashed primary would be lost which isn't good in a true
> HA solution.

This topic has come up a few times recently [0] [1] [2].

> Thoughts?

І think this will require a fair amount of discussion.  I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits, but I know folks have concerns with that
approach.  When I looked at this stuff previously [2], it seemed possible
to handle the other data loss scenarios (e.g., forcing failover whenever
the primary shut down, turning off restart_after_crash).  However, I'm not
wedded to this approach.

[0] https://postgr.es/m/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] https://postgr.es/m/cac4b9df-92c6-77aa-687b-18b86cb13728%40stratox.cz
[2] https://postgr.es/m/FDE157D7-3F35-450D-B927-7EC2F82DB1D6%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> With synchronous replication typically all the transactions (txns)
> first locally get committed, then streamed to the sync standbys and
> the backend that generated the transaction will wait for ack from sync
> standbys. While waiting for ack, it may happen that the query or the
> txn gets canceled (QueryCancelPending is true) or the waiting backend
> is asked to exit (ProcDiePending is true). In either of these cases,
> the wait for ack gets canceled and leaves the txn in an inconsistent
> state [...]
> 
> Here's a proposal (mentioned previously by Satya [1]) to avoid the
> above problems:
> 1) Wait a configurable amount of time before canceling the sync
> replication by the backends i.e. delay processing of
> QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> synchronous_replication_naptime_before_cancel, when set, it will let
> the backends wait for the ack before canceling the synchronous
> replication so that the transaction can be available in sync standbys
> as well.
> 2) Wait for sync standbys to catch up upon restart after the crash or
> in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby.  I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

Yours,
Laurenz Albe





> 25 апр. 2022 г., в 21:48, Nathan Bossart <nathandbossart@gmail.com> написал(а):
>
> I'm personally in
> favor of just adding a GUC that can be enabled to block canceling
> synchronous replication waits

+1. I think it's the only option to provide quorum commit guarantees.

Best regards, Andrey Borodin.


On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> > With synchronous replication typically all the transactions (txns)
> > first locally get committed, then streamed to the sync standbys and
> > the backend that generated the transaction will wait for ack from sync
> > standbys. While waiting for ack, it may happen that the query or the
> > txn gets canceled (QueryCancelPending is true) or the waiting backend
> > is asked to exit (ProcDiePending is true). In either of these cases,
> > the wait for ack gets canceled and leaves the txn in an inconsistent
> > state [...]
> >
> > Here's a proposal (mentioned previously by Satya [1]) to avoid the
> > above problems:
> > 1) Wait a configurable amount of time before canceling the sync
> > replication by the backends i.e. delay processing of
> > QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> > synchronous_replication_naptime_before_cancel, when set, it will let
> > the backends wait for the ack before canceling the synchronous
> > replication so that the transaction can be available in sync standbys
> > as well.
> > 2) Wait for sync standbys to catch up upon restart after the crash or
> > in the next txn after the old locally committed txn was canceled.
>
> While this may mitigate the problem, I don't think it will deal with
> all the cases which could cause a transaction to end up committed locally,
> but not on the synchronous standby.  I think that only using the full
> power of two-phase commit can make this bulletproof.

Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?

> Is it worth adding additional complexity that is not a complete solution?

The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.

[1] https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

Regards,
Bharath Rupireddy.



On Mon, May 9, 2022 at 2:50 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >
> > On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> > > With synchronous replication typically all the transactions (txns)
> > > first locally get committed, then streamed to the sync standbys and
> > > the backend that generated the transaction will wait for ack from sync
> > > standbys. While waiting for ack, it may happen that the query or the
> > > txn gets canceled (QueryCancelPending is true) or the waiting backend
> > > is asked to exit (ProcDiePending is true). In either of these cases,
> > > the wait for ack gets canceled and leaves the txn in an inconsistent
> > > state [...]
> > >
> > > Here's a proposal (mentioned previously by Satya [1]) to avoid the
> > > above problems:
> > > 1) Wait a configurable amount of time before canceling the sync
> > > replication by the backends i.e. delay processing of
> > > QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> > > synchronous_replication_naptime_before_cancel, when set, it will let
> > > the backends wait for the ack before canceling the synchronous
> > > replication so that the transaction can be available in sync standbys
> > > as well.
> > > 2) Wait for sync standbys to catch up upon restart after the crash or
> > > in the next txn after the old locally committed txn was canceled.
> >
> > While this may mitigate the problem, I don't think it will deal with
> > all the cases which could cause a transaction to end up committed locally,
> > but not on the synchronous standby.  I think that only using the full
> > power of two-phase commit can make this bulletproof.
>
> Not sure if it's recommended to use 2PC in postgres HA with sync
> replication where the documentation says that "PREPARE TRANSACTION"
> and other 2PC commands are "intended for use by external transaction
> management systems" and with explicit transactions. Whereas, the txns
> within a postgres HA with sync replication always don't have to be
> explicit txns. Am I missing something here?
>
> > Is it worth adding additional complexity that is not a complete solution?
>
> The proposed approach helps to avoid some common possible problems
> that arise with simple scenarios (like cancelling a long running query
> while in SyncRepWaitForLSN) within sync replication.

IMHO, making it wait for some amount of time, based on GUC is not a
complete solution.  It is just a hack to avoid the problem in some
cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




> On 9 May 2022, at 14:20, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>>
>> While this may mitigate the problem, I don't think it will deal with
>> all the cases which could cause a transaction to end up committed locally,
>> but not on the synchronous standby.  I think that only using the full
>> power of two-phase commit can make this bulletproof.
>
> Not sure if it's recommended to use 2PC in postgres HA with sync
> replication where the documentation says that "PREPARE TRANSACTION"
> and other 2PC commands are "intended for use by external transaction
> management systems" and with explicit transactions. Whereas, the txns
> within a postgres HA with sync replication always don't have to be
> explicit txns. Am I missing something here?

COMMIT PREPARED needs to be replicated as well, thus encountering the very same problem as usual COMMIT: if done during
failoverit can be canceled and committed data can be wrongfully reported durably written. 2PC is not a remedy to the
factthat PG silently cancels awaiting of sync replication. The problem arise in presence of any "commit". And "commit"
isthere if transactions are there. 

> On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> IMHO, making it wait for some amount of time, based on GUC is not a
> complete solution.  It is just a hack to avoid the problem in some
> cases.

Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously
installedto make backend responsible to Ctrl+C (or client side statement timeout). 

> On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> Is it worth adding additional complexity that is not a complete solution?

Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely
talkingnot about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until
synchrous_commit\synchronous_standby_namesare satisfied ) 

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence
ofa server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline
endis successfully replicated to synchronous_standby_names. 

Best regards, Andrey Borodin.


On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

> > On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > IMHO, making it wait for some amount of time, based on GUC is not a
> > complete solution.  It is just a hack to avoid the problem in some
> > cases.
>
> Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously
installedto make backend responsible to Ctrl+C (or client side statement timeout).
 

I might be missing something but based on my understanding the
approach is not disallowing the query cancellation but it is just
adding the configuration for how much to delay before canceling the
query.  That's the reason I mentioned that this is not a guarenteed
solution.  I mean with this configuration value also you can not avoid
problems in all the cases, right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Tue, May 10, 2022 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > > On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > IMHO, making it wait for some amount of time, based on GUC is not a
> > > complete solution.  It is just a hack to avoid the problem in some
> > > cases.
> >
> > Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was
erroneouslyinstalled to make backend responsible to Ctrl+C (or client side statement timeout).
 
>
> I might be missing something but based on my understanding the
> approach is not disallowing the query cancellation but it is just
> adding the configuration for how much to delay before canceling the
> query.  That's the reason I mentioned that this is not a guarenteed
> solution.  I mean with this configuration value also you can not avoid
> problems in all the cases, right?

Yes Dilip, the proposed GUC in v1 patch doesn't allow waiting forever
for sync repl ack, in other words, doesn't allow blocking the pending
query cancels or proc die interrupts forever. The backends may linger
in case repl ack isn't received or sync replicas aren't reachable?
Users may have to set the GUC to a 'reasonable value'.

If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.

Regards,
Bharath Rupireddy.




> 10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
>
> If okay, I can make the GUC behave this way - value 0 existing
> behaviour i.e. no wait for sync repl ack, just process query cancels
> and proc die interrupts immediately; value -1 wait unboundedly for the
> ack; value > 0 wait for specified milliseconds for the ack.
+1 if we make -1 and 0 only valid values.

> query cancels or proc die interrupts

Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet
onceupon a time you want to shutdown postgres without coredump - thus proc die needs to be processed. 

Thanks!

Best regards, Andrey Borodin.


On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > 10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
> >
> > If okay, I can make the GUC behave this way - value 0 existing
> > behaviour i.e. no wait for sync repl ack, just process query cancels
> > and proc die interrupts immediately; value -1 wait unboundedly for the
> > ack; value > 0 wait for specified milliseconds for the ack.
> +1 if we make -1 and 0 only valid values.
>
> > query cancels or proc die interrupts
>
> Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

Agree.

> When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels.

When standby is promoted, no point the old primary waiting forever for
ack assuming we are going to discard it.

> Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.

I think users can still have the flexibility to set the right amounts
of time to process cancel and proc die interrupts.

IMHO, the GUC can still have 0, -1 and value > 0 in milliseconds, let
the users decide on what they want. Do you see any problems with this?

Thoughts?

Regards,
Bharath Rupireddy.



On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > 10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
> >
> > If okay, I can make the GUC behave this way - value 0 existing
> > behaviour i.e. no wait for sync repl ack, just process query cancels
> > and proc die interrupts immediately; value -1 wait unboundedly for the
> > ack; value > 0 wait for specified milliseconds for the ack.
> +1 if we make -1 and 0 only valid values.
>
> > query cancels or proc die interrupts
>
> Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.

However, it's good to see what other hackers think about this.

> When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet
onceupon a time you want to shutdown postgres without coredump - thus proc die needs to be processed. 

Agree.

On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >
> > Is it worth adding additional complexity that is not a complete solution?
>
> Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely
talkingnot about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until
synchrous_commit\synchronous_standby_namesare satisfied ) 
>
> And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in
presenceof a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of
timelineend is successfully replicated to synchronous_standby_names. 

Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].

Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Regards,
Bharath Rupireddy.




> 25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
>
> Hm, after thinking for a while, I tend to agree with the above
> approach - meaning, query cancel interrupt processing can completely
> be disabled in SyncRepWaitForLSN() and process proc die interrupt
> immediately, this approach requires no GUC as opposed to the proposed
> v1 patch upthread.
GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is fine
too.If we do not allow cancelation of unreplicated backends, of course. 


>>
>> And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in
presenceof a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of
timelineend is successfully replicated to synchronous_standby_names. 
>
> Hm, that needs to be done anyways. How about doing as proposed
> initially upthread [1]? Also, quoting the idea here [2].
>
> Thoughts?
>
> [1] https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
> [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> in the next txn after the old locally committed txn was canceled. One
> way to achieve this is to let the backend, that's making the first
> connection, wait for sync standbys to catch up in ClientAuthentication
> right after successful authentication. However, I'm not sure this is
> the best way to do it at this point.


I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not
replicatedto quorum al least up until new timeline LSN. 

Thanks!

[0] https://commitfest.postgresql.org/34/2402/




On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > 25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
> >
> > Hm, after thinking for a while, I tend to agree with the above
> > approach - meaning, query cancel interrupt processing can completely
> > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > immediately, this approach requires no GUC as opposed to the proposed
> > v1 patch upthread.
> GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is
finetoo. If we do not allow cancelation of unreplicated backends, of course. 
>
> >>
> >> And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in
presenceof a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of
timelineend is successfully replicated to synchronous_standby_names. 
> >
> > Hm, that needs to be done anyways. How about doing as proposed
> > initially upthread [1]? Also, quoting the idea here [2].
> >
> > Thoughts?
> >
> > [1] https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
> > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > in the next txn after the old locally committed txn was canceled. One
> > way to achieve this is to let the backend, that's making the first
> > connection, wait for sync standbys to catch up in ClientAuthentication
> > right after successful authentication. However, I'm not sure this is
> > the best way to do it at this point.
>
>
> I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not
replicatedto quorum al least up until new timeline LSN. 

We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.



--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > 25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
> > >
> > > Hm, after thinking for a while, I tend to agree with the above
> > > approach - meaning, query cancel interrupt processing can completely
> > > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > > immediately, this approach requires no GUC as opposed to the proposed
> > > v1 patch upthread.
> > GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is
finetoo. If we do not allow cancelation of unreplicated backends, of course. 
> >
> > >>
> > >> And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in
presenceof a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of
timelineend is successfully replicated to synchronous_standby_names. 
> > >
> > > Hm, that needs to be done anyways. How about doing as proposed
> > > initially upthread [1]? Also, quoting the idea here [2].
> > >
> > > Thoughts?
> > >
> > > [1] https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
> > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > > in the next txn after the old locally committed txn was canceled. One
> > > way to achieve this is to let the backend, that's making the first
> > > connection, wait for sync standbys to catch up in ClientAuthentication
> > > right after successful authentication. However, I'm not sure this is
> > > the best way to do it at this point.
> >
> >
> > I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is
notreplicated to quorum al least up until new timeline LSN. 
>
> We can't do it in CheckRecoveryConsistency() unless I'm missing
> something. Because, the walsenders (required for sending the remaining
> WAL to sync standbys to achieve quorum) can only be started after the
> server reaches a consistent state, after all walsenders are
> specialized backends.

Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart -  roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:

if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
      SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
      shmem->wait_for_sync_repl_upon_restart = false;
}

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in 
> While this may mitigate the problem, I don't think it will deal with
> all the cases which could cause a transaction to end up committed locally,
> but not on the synchronous standby.  I think that only using the full
> power of two-phase commit can make this bulletproof.
> 
> Is it worth adding additional complexity that is not a complete solution?

I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.

Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?

We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in
> > While this may mitigate the problem, I don't think it will deal with
> > all the cases which could cause a transaction to end up committed locally,
> > but not on the synchronous standby.  I think that only using the full
> > power of two-phase commit can make this bulletproof.
> >
> > Is it worth adding additional complexity that is not a complete solution?
>
> I would agree to this. Likewise 2PC, whatever we do to make it
> perfect, we're left with unresolvable problems at least for now.
>
> Doesn't it meet your requirements if we have a means to know the last
> transaction on the current session is locally committed or aborted?
>
> We are already internally managing last committed LSN. I think we can
> do the same thing about transaction abort and last inserted LSN and we
> can expose them any way. This is way simpler than the (maybe)
> uncompletable attempt to fill up the deep gap.

There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.

Can you please explain more about your idea, I may be missing something?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



At Mon, 8 Aug 2022 19:13:25 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in
> > > While this may mitigate the problem, I don't think it will deal with
> > > all the cases which could cause a transaction to end up committed locally,
> > > but not on the synchronous standby.  I think that only using the full
> > > power of two-phase commit can make this bulletproof.
> > >
> > > Is it worth adding additional complexity that is not a complete solution?
> >
> > I would agree to this. Likewise 2PC, whatever we do to make it
> > perfect, we're left with unresolvable problems at least for now.
> >
> > Doesn't it meet your requirements if we have a means to know the last
> > transaction on the current session is locally committed or aborted?
> >
> > We are already internally managing last committed LSN. I think we can
> > do the same thing about transaction abort and last inserted LSN and we
> > can expose them any way. This is way simpler than the (maybe)
> > uncompletable attempt to fill up the deep gap.
> 
> There can be more txns that are
> locally-committed-but-not-yet-replicated. Even if we have that
> information stored somewhere, what do we do with it? Those txns are
> committed from the client perspective but not committed from the
> server's perspective.
> 
> Can you please explain more about your idea, I may be missing something?

(I'm not sure I understand the requirements here..)

I understand that it is about query cancellation.  In the case of
primary crash/termination, client cannot even know whether the commit
of the ongoing transaction, if any, has been recorded.  Anyway no way
other than to somehow confirm that the change by the transaction has
been actually made after restart.  I believe it is the standard
practice of the applications that work on HA clusters.

The same is true in the case of query cancellation since commit
response doesn't reach the client, too.  But even in this case if we
had functions/views that tells us the
last-committed/last-aborted/last-inserted LSN on a session, we can
know whether the last transaction has been committed along with the
commit LSN maybe more easily.

# In fact, I see those functions rather as a means to know whether a
# change by the last transaction on a session is available on some
# replica.

For example, the below heavily simplified pseudo code might display
how the fucntions (if they were functions) work.

  try {
    s.execute("INSERT ..");
    c.commit();
  } catch (Exception x) {
    c.commit();
    if (s.execute("SELECT pg_session_last_committed_lsn() = "
                           "pg_session_last_inserted_lsn()"))
    {
      /* the transaction has been locally committed */
      if (s.execute("SELECT replay_lsn >= pg_session_last_committed_lsn() "
                           "FROM pg_stat_replication where xxxx")
       /* the commit has been replicated to xxx, LSN is known */
    } else {
      /* the transaction has not been locally committed */
      <retry?>
    }
  }


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Tue, Aug 9, 2022 at 12:42 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> > Can you please explain more about your idea, I may be missing something?
>
> (I'm not sure I understand the requirements here..)

I've explained the problem with the current HA setup with synchronous
replication upthread at [1]. Let me reiterate it here once again.

When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2].

The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.

When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.

The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]'.

I hope the above explanation helps.

[1]
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3] https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I've explained the problem with the current HA setup with synchronous
> replication upthread at [1]. Let me reiterate it here once again.
>
> When a query is cancelled (a simple stroke of CTRL+C or
> pg_cancel_backend() call) while the txn is waiting for ack in
> SyncRepWaitForLSN(); for the client, the txn is actually committed
> (locally-committed-but-not-yet-replicated to all of sync standbys)
> like any other txn, a warning is emitted into server logs but it is of
> no use for the client (think of client as applications). There can be
> many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
> can be issued on all of those sessions. The problem is that the
> subsequent reads will then be able to read all of those
> locally-committed-but-not-yet-replicated to all of sync standbys txns
> data - this is what I call an inconsistency (can we call this a
> read-after-write inconsistency?) because of lack of proper query
> cancel handling. And if the sync standbys are down or unable to come
> up for some reason, until then, the primary will be serving clients
> with the inconsistent data. BTW, I found a report of this problem here
> [2].
>
> The solution proposed for the above problem is to just 'not honor
> query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
>
> When a proc die is pending, then also, there can be
> locally-committed-but-not-yet-replicated to all of sync standbys txns.
> Typically, there are two choices for the clients 1) reuse the primary
> instance after restart 2) failover to one of sync standbys. For case
> (1), there might be read-after-write inconsistency as explained above.
> For case (2), those txns might get lost completely if the failover
> target sync standby or the new primary didn't receive them and the
> other sync standbys that have received them are now ahead and need a
> special treatment (run pg_rewind) for them to be able to connect to
> new primary.
>
> The solution proposed for case (1) of the above problem is to 'process
> the ProcDiePending immediately and upon restart the first backend can
> wait until the sync standbys are caught up to ensure no inconsistent
> reads'.
> The solution proposed for case (2) of the above problem is to 'either
> run pg_rewind for the sync standbys that are ahead or use the idea
> proposed at [3]'.
>
> I hope the above explanation helps.
>
> [1]
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
> [2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
> [3] https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com

I'm attaching the v2 patch rebased on the latest HEAD. Please note
that there are still some open points, I'm yet to find time to think
more about them. Meanwhile, I'm posting the v2 patch for making cfbot
happy. Any further thoughts on the overall design of the patch are
most welcome. Thanks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Tue, Sep 27, 2022 at 06:52:21PM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > I've explained the problem with the current HA setup with synchronous
> > replication upthread at [1]. Let me reiterate it here once again.
> >
> > When a query is cancelled (a simple stroke of CTRL+C or
> > pg_cancel_backend() call) while the txn is waiting for ack in
> > SyncRepWaitForLSN(); for the client, the txn is actually committed
> > (locally-committed-but-not-yet-replicated to all of sync standbys)
> > like any other txn, a warning is emitted into server logs but it is of
> > no use for the client (think of client as applications). There can be
> > many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
> > can be issued on all of those sessions. The problem is that the
> > subsequent reads will then be able to read all of those
> > locally-committed-but-not-yet-replicated to all of sync standbys txns
> > data - this is what I call an inconsistency (can we call this a
> > read-after-write inconsistency?) because of lack of proper query
> > cancel handling. And if the sync standbys are down or unable to come
> > up for some reason, until then, the primary will be serving clients
> > with the inconsistent data. BTW, I found a report of this problem here
> > [2].
> >
> > The solution proposed for the above problem is to just 'not honor
> > query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
> >
> > When a proc die is pending, then also, there can be
> > locally-committed-but-not-yet-replicated to all of sync standbys txns.
> > Typically, there are two choices for the clients 1) reuse the primary
> > instance after restart 2) failover to one of sync standbys. For case
> > (1), there might be read-after-write inconsistency as explained above.
> > For case (2), those txns might get lost completely if the failover
> > target sync standby or the new primary didn't receive them and the
> > other sync standbys that have received them are now ahead and need a
> > special treatment (run pg_rewind) for them to be able to connect to
> > new primary.
> >
> > The solution proposed for case (1) of the above problem is to 'process
> > the ProcDiePending immediately and upon restart the first backend can
> > wait until the sync standbys are caught up to ensure no inconsistent
> > reads'.
> > The solution proposed for case (2) of the above problem is to 'either
> > run pg_rewind for the sync standbys that are ahead or use the idea
> > proposed at [3]'.
> >
> > I hope the above explanation helps.
> >
> > [1]
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
> > [2]
https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
> > [3] https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com
> 
> I'm attaching the v2 patch rebased on the latest HEAD. Please note
> that there are still some open points, I'm yet to find time to think
> more about them. Meanwhile, I'm posting the v2 patch for making cfbot
> happy. Any further thoughts on the overall design of the patch are
> most welcome. Thanks.
... 
> In PostgreSQL high availability setup with synchronous replication,
> typically all the transactions first locally get committed, then
> streamed to the synchronous standbys and the backends that generated
> the transaction will wait for acknowledgement from synchronous
> standbys. While waiting for acknowledgement, it may happen that the
> query or the transaction gets canceled or the backend that's waiting
> for acknowledgement is asked to exit. In either of these cases, the
> wait for acknowledgement gets canceled and leaves transaction in an
> inconsistent state as it got committed locally but not on the
> standbys. When set the GUC synchronous_replication_naptime_before_cancel
> introduced in this patch, it will let the backends wait for the
> acknowledgement before canceling the wait for acknowledgement so
> that the transaction can be available in synchronous standbys as
> well.

I don't think this patch is going in the right direction, and I think we
need to step back to see why.

First, let's see how synchronous replication works.  Each backend waits
for one or more synchronous replicas to acknowledge the WAL related to
its commit and then it marks the commit as done in PGPROC and then to
the client;  I wrote a blog about it:

    https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020

So, what happens when an insufficient number of synchronous replicas
reply?  Sessions hang because the synchronous behavior cannot be
guaranteed.  We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.

What the proposed patch effectively does is to prevent the ability to
recovery the hung sessions or auto-continue the sessions if an
insufficient number of synchronous replicas respond.  So, in a way, it
is allowing for more strict and less strict behavior of
synchronous_standby_names.

However, I think trying to solve this at the session level is the wrong
approach.  If you set a timeout to continue stuck sessions, odds are the
timeout will be too long for each session and performance will be
unacceptable anyway, so you haven't gained much.  If you prevent
cancels, you effectively lock up the system with fewer options of
recovery.

I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica.  This gives a
much more predictable and useful behavior than the one in the patch ---
we have discussed this approach many times on the email lists.

Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional.  We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, and if we
should allow users to specify a retry interval to resynchronize the
synchronous replicas.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




On Fri, Sep 30, 2022 at 4:23 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> I don't think this patch is going in the right direction, and I think we
> need to step back to see why.

Thanks a lot Bruce for providing insights.

> First, let's see how synchronous replication works.  Each backend waits
> for one or more synchronous replicas to acknowledge the WAL related to
> its commit and then it marks the commit as done in PGPROC and then to
> the client;  I wrote a blog about it:
>
>         https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020

Great!

> So, what happens when an insufficient number of synchronous replicas
> reply?  Sessions hang because the synchronous behavior cannot be
> guaranteed.  We then _allow_ query cancel so the user or administrator
> can get out of the hung sessions and perhaps modify
> synchronous_standby_names.

Right.

> What the proposed patch effectively does is to prevent the ability to
> recovery the hung sessions or auto-continue the sessions if an
> insufficient number of synchronous replicas respond.  So, in a way, it
> is allowing for more strict and less strict behavior of
> synchronous_standby_names.

Yes. I do agree that it makes the sessions further wait and closes the
quick exit path that admins/users have when the problem occurs. But it
disallows users cancelling queries or terminating backends only to end
up in locally-committed-but-not-replicated transactions on a server
setup where sync standbys all are working fine. I agree that we don't
want to further wait on cancels or proc dies as it makes the system
more unresponsive [1].

> However, I think trying to solve this at the session level is the wrong
> approach.  If you set a timeout to continue stuck sessions, odds are the
> timeout will be too long for each session and performance will be
> unacceptable anyway, so you haven't gained much.  If you prevent
> cancels, you effectively lock up the system with fewer options of
> recovery.

Yes.

> I have always felt this has to be done at the server level, meaning when
> a synchronous_standby_names replica is not responding after a certain
> timeout, the administrator must be notified by calling a shell command
> defined in a GUC and all sessions will ignore the replica.  This gives a
> much more predictable and useful behavior than the one in the patch ---
> we have discussed this approach many times on the email lists.

IIUC, each walsender serving a sync standby will determine that the
sync standby isn't responding for a configurable amount of time (less
than wal_sender_timeout) and calls shell command to notify the admin
if there are any backends waiting for sync replication in
SyncRepWaitForLSN(). The shell command then provides the unresponsive
sync standby name at the bare minimum for the admin to ignore it as
sync standby/remove it from synchronous_standby_names to continue
further. This still requires manual intervention which is a problem if
running postgres server instances at scale. Also, having a new shell
command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.

I'm thinking about the possible options that an admin has to get out
of this situation:
1) Removing the standby from synchronous_standby_names.
2) Fixing the sync standby, by restarting or restoring the lost part
(such as network or some other).

(1) is something that postgres can help admins get out of the problem
easily and automatically without any intervention. (2) is something
postgres can't do much about.

How about we let postgres automatically remove an unresponsive (for a
pre-configured time) sync standby from synchronous_standby_names and
inform the user (via log message and via new walsender property and
pg_stat_replication for monitoring purposes)? The users can then
detect such standbys and later try to bring them back to the sync
standbys group or do other things. I believe that a production level
postgres HA with sync standbys will have monitoring to detect the
replication lag, failover decision etc via monitoring
pg_stat_replication. With this approach, a bit more monitoring is
needed. This solution requires less or no manual intervention and
scales well. Please note that I haven't studied the possibilities of
implementing it yet.

Thoughts?

> Once we have that, we can consider removing the cancel ability while
> waiting for synchronous replicas (since we have the timeout) or make it
> optional.  We can also consider how do notify the administrator during
> query cancel (if we allow it), backend abrupt exit/crash, and

Yeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().

> if we
> should allow users to specify a retry interval to resynchronize the
> synchronous replicas.

This is another interesting thing to consider if we were to make the
auto-removed (by the above approach) standby a sync standby again
without manual intervention.

Thoughts?

[1] https://www.postgresql.org/message-id/CA%2BTgmoaCBwgMDkeBDOgtPgHcbfSYq%2BzORjL5DoU3pJyjALxtoQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Sat, Oct  1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote:
> > I have always felt this has to be done at the server level, meaning when
> > a synchronous_standby_names replica is not responding after a certain
> > timeout, the administrator must be notified by calling a shell command
> > defined in a GUC and all sessions will ignore the replica.  This gives a
                         ------------------------------------
> > much more predictable and useful behavior than the one in the patch ---
> > we have discussed this approach many times on the email lists.
> 
> IIUC, each walsender serving a sync standby will determine that the
> sync standby isn't responding for a configurable amount of time (less
> than wal_sender_timeout) and calls shell command to notify the admin
> if there are any backends waiting for sync replication in
> SyncRepWaitForLSN(). The shell command then provides the unresponsive
> sync standby name at the bare minimum for the admin to ignore it as
> sync standby/remove it from synchronous_standby_names to continue
> further. This still requires manual intervention which is a problem if
> running postgres server instances at scale. Also, having a new shell

As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it.  If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.

> command in any form may pose security risks. I'm not sure at this
> point how this new timeout is going to work alongside
> wal_sender_timeout.

We have archive_command, so I don't see a problem with another shell
command.

> I'm thinking about the possible options that an admin has to get out
> of this situation:
> 1) Removing the standby from synchronous_standby_names.

Yes, see above.  We might need a read-only GUC that reports which
sychronous replicas are active.  As you can see, there is a lot of API
design required here, but this is the most effective approach.

> 2) Fixing the sync standby, by restarting or restoring the lost part
> (such as network or some other).
> 
> (1) is something that postgres can help admins get out of the problem
> easily and automatically without any intervention. (2) is something
> postgres can't do much about.
> 
> How about we let postgres automatically remove an unresponsive (for a
> pre-configured time) sync standby from synchronous_standby_names and
> inform the user (via log message and via new walsender property and
> pg_stat_replication for monitoring purposes)? The users can then
> detect such standbys and later try to bring them back to the sync
> standbys group or do other things. I believe that a production level
> postgres HA with sync standbys will have monitoring to detect the
> replication lag, failover decision etc via monitoring
> pg_stat_replication. With this approach, a bit more monitoring is
> needed. This solution requires less or no manual intervention and
> scales well. Please note that I haven't studied the possibilities of
> implementing it yet.
> 
> Thoughts?

Yes, see above.

> > Once we have that, we can consider removing the cancel ability while
> > waiting for synchronous replicas (since we have the timeout) or make it
> > optional.  We can also consider how do notify the administrator during
> > query cancel (if we allow it), backend abrupt exit/crash, and
> 
> Yeah. If we have the
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> the users can then choose to disable processing query cancels/proc
> dies while waiting for sync replication in SyncRepWaitForLSN().

Yes.  We might also change things so a query cancel that happens during 
sychronous replica waiting can only be done by an administrator, not the
session owner.  Again, lots of design needed here.

> > if we
> > should allow users to specify a retry interval to resynchronize the
> > synchronous replicas.
> 
> This is another interesting thing to consider if we were to make the
> auto-removed (by the above approach) standby a sync standby again
> without manual intervention.

Yes, see above.  You are addressing the right questions here.  :-)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> As I highlighted above, by default you notify the administrator that a
> sychronous replica is not responding and then ignore it.  If it becomes
> responsive again, you notify the administrator again and add it back as
> a sychronous replica.
>
> > command in any form may pose security risks. I'm not sure at this
> > point how this new timeout is going to work alongside
> > wal_sender_timeout.
>
> We have archive_command, so I don't see a problem with another shell
> command.

Why do we need a new command to inform the admin/user about a sync
replication being ignored (from sync quorum) for not responding or
acknowledging for a certain amount of time in SyncRepWaitForLSN()?
Can't we just add an extra column or use existing sync_state in
pg_stat_replication()? We can either introduce a new state such as
temporary_async or just use the existing state 'potential' [1]. A
problem is that the server has to be monitored for this extra, new
state. If we do this, we don't need another command to report.

> > I'm thinking about the possible options that an admin has to get out
> > of this situation:
> > 1) Removing the standby from synchronous_standby_names.
>
> Yes, see above.  We might need a read-only GUC that reports which
> sychronous replicas are active.  As you can see, there is a lot of API
> design required here, but this is the most effective approach.

If we use the above approach to report via pg_stat_replication(), we
don't need this.

> > > Once we have that, we can consider removing the cancel ability while
> > > waiting for synchronous replicas (since we have the timeout) or make it
> > > optional.  We can also consider how do notify the administrator during
> > > query cancel (if we allow it), backend abrupt exit/crash, and
> >
> > Yeah. If we have the
> > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> > the users can then choose to disable processing query cancels/proc
> > dies while waiting for sync replication in SyncRepWaitForLSN().
>
> Yes.  We might also change things so a query cancel that happens during
> sychronous replica waiting can only be done by an administrator, not the
> session owner.  Again, lots of design needed here.

Yes, we need infrastructure to track who issued the query cancel or
proc die and so on. IMO, it's not a good way to allow/disallow query
cancels or CTRL+C based on role types - superusers or users with
replication roles or users who are members of any of predefined roles.

In general, it is the walsender serving sync standby that has to mark
itself as async standby by removing itself from
synchronous_standby_names, reloading config variables and waking up
the backends that are waiting in syncrep wait queue for it to update
LSN.

And, the new auto removal timeout should always be set to less than
wal_sender_timeout.

All that said, imagine we have
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
in one or the other forms with auto removal timeout set to 5 minutes,
any of following can happen:

1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
no query cancel or proc die interrupt is arrived, the sync standby is
made as async standy after the timeout i.e. 5 minutes.
2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
say for about 3 minutes, then query cancel or proc die interrupt is
arrived, should we immediately process it or wait for timeout to
happen (2 more minutes) and then process the interrupt? If we
immediately process the interrupts, then the
locally-committed-but-not-replicated-to-sync-standby problems
described upthread [2] are left unresolved.

[1] https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW
sync_state text
Synchronous state of this standby server. Possible values are:
async: This standby server is asynchronous.
potential: This standby server is now asynchronous, but can
potentially become synchronous if one of current synchronous ones
fails.
sync: This standby server is synchronous.
quorum: This standby server is considered as a candidate for quorum standbys.

[2] https://www.postgresql.org/message-id/CALj2ACXmMWtpmuT-%3Dv8F%2BLk4QCbdkeN%2ByHKXeRGKFfjG96YbKA%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Thu, Oct  6, 2022 at 01:33:33PM +0530, Bharath Rupireddy wrote:
> On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > As I highlighted above, by default you notify the administrator that a
> > sychronous replica is not responding and then ignore it.  If it becomes
> > responsive again, you notify the administrator again and add it back as
> > a sychronous replica.
> >
> > > command in any form may pose security risks. I'm not sure at this
> > > point how this new timeout is going to work alongside
> > > wal_sender_timeout.
> >
> > We have archive_command, so I don't see a problem with another shell
> > command.
> 
> Why do we need a new command to inform the admin/user about a sync
> replication being ignored (from sync quorum) for not responding or
> acknowledging for a certain amount of time in SyncRepWaitForLSN()?
> Can't we just add an extra column or use existing sync_state in
> pg_stat_replication()? We can either introduce a new state such as
> temporary_async or just use the existing state 'potential' [1]. A
> problem is that the server has to be monitored for this extra, new
> state. If we do this, we don't need another command to report.

Yes, that is a good point.  I assumed people would want notification
immediately rather than waiting for monitoring to notice it.  Consider
if you monitor every five seconds but the primary loses sync and goes
down during that five-second interval --- there would be no way to know
if sync stopped and reported committed transactions to the client before
the primary went down.  I would love to just rely on monitoring but I am
not sure that is sufficient for this use-case.

Of course, if email is being sent it might be still in the email queue
when the primary goes down, but I guess if I was doing it I would make
sure the email was delivered _before_ returning.  The point is that we
would not disable the sync and acknowledge the commit to the client
until the notification command returns success --- that kind of
guarantee is hard to do with monitoring.

These are good discussions to have --- maybe I am wrong.

> > > > Once we have that, we can consider removing the cancel ability while
> > > > waiting for synchronous replicas (since we have the timeout) or make it
> > > > optional.  We can also consider how do notify the administrator during
> > > > query cancel (if we allow it), backend abrupt exit/crash, and
> > >
> > > Yeah. If we have the
> > > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> > > the users can then choose to disable processing query cancels/proc
> > > dies while waiting for sync replication in SyncRepWaitForLSN().
> >
> > Yes.  We might also change things so a query cancel that happens during
> > sychronous replica waiting can only be done by an administrator, not the
> > session owner.  Again, lots of design needed here.
> 
> Yes, we need infrastructure to track who issued the query cancel or
> proc die and so on. IMO, it's not a good way to allow/disallow query
> cancels or CTRL+C based on role types - superusers or users with
> replication roles or users who are members of any of predefined roles.
> 
> In general, it is the walsender serving sync standby that has to mark
> itself as async standby by removing itself from
> synchronous_standby_names, reloading config variables and waking up
> the backends that are waiting in syncrep wait queue for it to update
> LSN.
> 
> And, the new auto removal timeout should always be set to less than
> wal_sender_timeout.
> 
> All that said, imagine we have
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
> in one or the other forms with auto removal timeout set to 5 minutes,
> any of following can happen:
> 
> 1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
> no query cancel or proc die interrupt is arrived, the sync standby is
> made as async standy after the timeout i.e. 5 minutes.
> 2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
> say for about 3 minutes, then query cancel or proc die interrupt is
> arrived, should we immediately process it or wait for timeout to
> happen (2 more minutes) and then process the interrupt? If we
> immediately process the interrupts, then the
> locally-committed-but-not-replicated-to-sync-standby problems
> described upthread [2] are left unresolved.

I have a feeling once we have the timeout, we would disable query cancel
when we are in this stage since it is canceling a committed query.  The
timeout would cancel the sync but at least the administrator would know.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> So, what happens when an insufficient number of synchronous replicas
> reply?

It's a failover.

> Sessions hang because the synchronous behavior cannot be
> guaranteed.  We then _allow_ query cancel so the user or administrator
> can get out of the hung sessions and perhaps modify
> synchronous_standby_names.

Administrators should not modify synchronous_standby_names.
Administrator must shoot this not in the head.

> I have always felt this has to be done at the server level, meaning when
> a synchronous_standby_names replica is not responding after a certain
> timeout, the administrator must be notified by calling a shell command
> defined in a GUC and all sessions will ignore the replica.

Standbys are expelled from the waitlist according to quorum rules. I'd
propose not to invent more quorum rules involving shell scripts.
The Administrator expressed what number of standbys can be offline by
setting synchronous_standby_names. They actively asked for hanging
queries in case of insufficient standbys.

We have reserved administrator connections for the case when all
connection slots are used by hanging queries.


Best regards, Andrey Borodin.



On Tue, Nov 8, 2022 at 9:06 PM Andrey Borodin <amborodin86@gmail.com> wrote:
>
> On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > So, what happens when an insufficient number of synchronous replicas
> > reply?
>
> It's a failover.
>
> > Sessions hang because the synchronous behavior cannot be
> > guaranteed.  We then _allow_ query cancel so the user or administrator
> > can get out of the hung sessions and perhaps modify
> > synchronous_standby_names.
>
> Administrators should not modify synchronous_standby_names.
> Administrator must shoot this not in the head.
>

Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."

So, for many services providing Postgres as a service it's only a
matter of wording.

Best regards, Andrey Borodin.



On Mon, Nov 28, 2022 at 12:57 AM Andrey Borodin <amborodin86@gmail.com> wrote:
>
> Some funny stuff. If a user tries to cancel a non-replicated transaction
> Azure Postgres will answer: "user requested cancel while waiting for
> synchronous replication ack. The COMMIT record has already flushed to
> WAL locally and might not have been replicatead to the standby. We
> must wait here."
> AWS RDS will answer: "ignoring request to cancel wait for synchronous
> replication"
> Yandex Managed Postgres will answer: "canceling wait for synchronous
> replication due requested, but cancelation is not allowed. The
> transaction has already committed locally and might not have been
> replicated to the standby. We must wait here."
>
> So, for many services providing Postgres as a service it's only a
> matter of wording.

Thanks for verifying the behaviour. And many thanks for an off-list chat.

FWIW, I'm planning to prepare a patch as per the below idea which is
something similar to the initial proposal in this thread. Meanwhile,
thoughts are welcome.

1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
sync replication acknowledgement.
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Sun, Nov 27, 2022 at 11:26:50AM -0800, Andrey Borodin wrote:
> Some funny stuff. If a user tries to cancel a non-replicated transaction
> Azure Postgres will answer: "user requested cancel while waiting for
> synchronous replication ack. The COMMIT record has already flushed to
> WAL locally and might not have been replicatead to the standby. We
> must wait here."
> AWS RDS will answer: "ignoring request to cancel wait for synchronous
> replication"
> Yandex Managed Postgres will answer: "canceling wait for synchronous
> replication due requested, but cancelation is not allowed. The
> transaction has already committed locally and might not have been
> replicated to the standby. We must wait here."
> 
> So, for many services providing Postgres as a service it's only a
> matter of wording.

Wow, you are telling me all three cloud vendors changed how query cancel
behaves on an unresponsive synchronous replica?  That is certainly a
testament that the community needs to change or at least review our
behavior.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.



On Mon, Nov 28, 2022 at 12:03:06PM +0530, Bharath Rupireddy wrote:
> Thanks for verifying the behaviour. And many thanks for an off-list chat.
> 
> FWIW, I'm planning to prepare a patch as per the below idea which is
> something similar to the initial proposal in this thread. Meanwhile,
> thoughts are welcome.
> 
> 1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
> sync replication acknowledgement.
> 2. Process proc die immediately when a backend is waiting for sync
> replication acknowledgement, as it does today, however, upon restart,
> don't open up for business (don't accept ready-only connections)
> unless the sync standbys have caught up.

You can prepare a patch, but it unlikely to get much interest until you
get agreement on what the behavior should be.  The optimal order of
developer actions is:

    Desirability -> Design -> Implement -> Test -> Review -> Commit
    https://wiki.postgresql.org/wiki/Todo#Development_Process

Telling us what other cloud vendors do is not sufficient.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.



On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> You can prepare a patch, but it unlikely to get much interest until you
> get agreement on what the behavior should be.

We discussed the approach on 2020's Unconference [0]. And there kind
of was an agreement.
Then I made a presentation on FOSDEM with all the details [1].
The patch had been on commitfest since 2019 [2]. There were reviewers
in the CF entry, and we kind of had an agreement.
Jeff Davis proposed a similar patch [3]. And we certainly agreed about cancels.
And now Bharath is proposing the same.

We have the interest and agreement.


Best regards, Andrey Borodin.

[0]
https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions
[1]
https://archive.fosdem.org/2021/schedule/event/postgresql_caveats_of_replication/attachments/slides/4365/export/events/attachments/postgresql_caveats_of_replication/slides/4365/sides.pdf
[2] https://commitfest.postgresql.org/26/2402/
[3]
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d



On Mon, Nov 28, 2022 at 01:31:39PM -0800, Andrey Borodin wrote:
> On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > You can prepare a patch, but it unlikely to get much interest until you
> > get agreement on what the behavior should be.
> 
> We discussed the approach on 2020's Unconference [0]. And there kind
> of was an agreement.
> Then I made a presentation on FOSDEM with all the details [1].
> The patch had been on commitfest since 2019 [2]. There were reviewers
> in the CF entry, and we kind of had an agreement.
> Jeff Davis proposed a similar patch [3]. And we certainly agreed about cancels.
> And now Bharath is proposing the same.
> 
> We have the interest and agreement.

Okay, I was not aware we had such broad agreement.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.





On Sun, Nov 27, 2022 at 10:33 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Nov 28, 2022 at 12:57 AM Andrey Borodin <amborodin86@gmail.com> wrote:
>
> Some funny stuff. If a user tries to cancel a non-replicated transaction
> Azure Postgres will answer: "user requested cancel while waiting for
> synchronous replication ack. The COMMIT record has already flushed to
> WAL locally and might not have been replicatead to the standby. We
> must wait here."
> AWS RDS will answer: "ignoring request to cancel wait for synchronous
> replication"
> Yandex Managed Postgres will answer: "canceling wait for synchronous
> replication due requested, but cancelation is not allowed. The
> transaction has already committed locally and might not have been
> replicated to the standby. We must wait here."
>
> So, for many services providing Postgres as a service it's only a
> matter of wording.

Thanks for verifying the behaviour. And many thanks for an off-list chat.

FWIW, I'm planning to prepare a patch as per the below idea which is
something similar to the initial proposal in this thread. Meanwhile,
thoughts are welcome.

1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
sync replication acknowledgement.

+1
 
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.

Are you planning to block connections or queries to the database? It would be good to allow connections and let them query the monitoring views but block the queries until sync standby have caught up. Otherwise, this leaves a monitoring hole. In cloud, I presume superusers are allowed to connect and monitor (end customers are not the role members and can't query the data). The same can't be true for all the installations. Could you please add more details on your approach?
 

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
>     2. Process proc die immediately when a backend is waiting for sync
>     replication acknowledgement, as it does today, however, upon restart,
>     don't open up for business (don't accept ready-only connections)
>     unless the sync standbys have caught up.
> 
> 
> Are you planning to block connections or queries to the database? It would be
> good to allow connections and let them query the monitoring views but block the
> queries until sync standby have caught up. Otherwise, this leaves a monitoring
> hole. In cloud, I presume superusers are allowed to connect and monitor (end
> customers are not the role members and can't query the data). The same can't be
> true for all the installations. Could you please add more details on your
> approach?

I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.





On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
>     2. Process proc die immediately when a backend is waiting for sync
>     replication acknowledgement, as it does today, however, upon restart,
>     don't open up for business (don't accept ready-only connections)
>     unless the sync standbys have caught up.
>
>
> Are you planning to block connections or queries to the database? It would be
> good to allow connections and let them query the monitoring views but block the
> queries until sync standby have caught up. Otherwise, this leaves a monitoring
> hole. In cloud, I presume superusers are allowed to connect and monitor (end
> customers are not the role members and can't query the data). The same can't be
> true for all the installations. Could you please add more details on your
> approach?

I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?

Yes, Change in synchronous_standby_names is expected in this situation. IMHO, blocking all the connections is not a recommended approach.



--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.


On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
>     2. Process proc die immediately when a backend is waiting for sync
>     replication acknowledgement, as it does today, however, upon restart,
>     don't open up for business (don't accept ready-only connections)
>     unless the sync standbys have caught up.
>
>
> Are you planning to block connections or queries to the database? It would be
> good to allow connections and let them query the monitoring views but block the
> queries until sync standby have caught up. Otherwise, this leaves a monitoring
> hole. In cloud, I presume superusers are allowed to connect and monitor (end
> customers are not the role members and can't query the data). The same can't be
> true for all the installations. Could you please add more details on your
> approach?

I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?

Yes, Change in synchronous_standby_names is expected in this situation. IMHO, blocking all the connections is not a recommended approach.

How about allowing superusers (they can still read locally committed data) and users part of pg_monitor role?




--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> >     2. Process proc die immediately when a backend is waiting for sync
> >     replication acknowledgement, as it does today, however, upon restart,
> >     don't open up for business (don't accept ready-only connections)
> >     unless the sync standbys have caught up.
> >
> >
> > Are you planning to block connections or queries to the database? It would be
> > good to allow connections and let them query the monitoring views but block the
> > queries until sync standby have caught up. Otherwise, this leaves a monitoring
> > hole. In cloud, I presume superusers are allowed to connect and monitor (end
> > customers are not the role members and can't query the data). The same can't be
> > true for all the installations. Could you please add more details on your
> > approach?
>
> I think ALTER SYSTEM should be allowed, particularly so you can modify
> synchronous_standby_names, no?

We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.

But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.

Best regards, Andrey Borodin.





On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin <amborodin86@gmail.com> wrote:
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> >     2. Process proc die immediately when a backend is waiting for sync
> >     replication acknowledgement, as it does today, however, upon restart,
> >     don't open up for business (don't accept ready-only connections)
> >     unless the sync standbys have caught up.
> >
> >
> > Are you planning to block connections or queries to the database? It would be
> > good to allow connections and let them query the monitoring views but block the
> > queries until sync standby have caught up. Otherwise, this leaves a monitoring
> > hole. In cloud, I presume superusers are allowed to connect and monitor (end
> > customers are not the role members and can't query the data). The same can't be
> > true for all the installations. Could you please add more details on your
> > approach?
>
> I think ALTER SYSTEM should be allowed, particularly so you can modify
> synchronous_standby_names, no?

We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.

In the crash recovery case, catalog is inconsistent but in this case, the cluster has remote uncommitted changes (consistent). Accepting a superuser connection is no harm. The auth checks performed are still valid after standbys fully caught up. I don't see a reason why superuser / pg_monitor connections are required to be blocked.


But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.

Supportability and operability are more important in corner cases to quickly troubleshoot an issue,
 

Best regards, Andrey Borodin.


On Tue, Nov 29, 2022 at 11:20 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin <amborodin86@gmail.com> wrote:
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> >     2. Process proc die immediately when a backend is waiting for sync
> >     replication acknowledgement, as it does today, however, upon restart,
> >     don't open up for business (don't accept ready-only connections)
> >     unless the sync standbys have caught up.
> >
> >
> > Are you planning to block connections or queries to the database? It would be
> > good to allow connections and let them query the monitoring views but block the
> > queries until sync standby have caught up. Otherwise, this leaves a monitoring
> > hole. In cloud, I presume superusers are allowed to connect and monitor (end
> > customers are not the role members and can't query the data). The same can't be
> > true for all the installations. Could you please add more details on your
> > approach?
>
> I think ALTER SYSTEM should be allowed, particularly so you can modify
> synchronous_standby_names, no?

We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.

In the crash recovery case, catalog is inconsistent but in this case, the cluster has remote uncommitted changes (consistent). Accepting a superuser connection is no harm. The auth checks performed are still valid after standbys fully caught up. I don't see a reason why superuser / pg_monitor connections are required to be blocked.

If blocking queries is harder, and superuser is not allowed to connect as it can read remote uncommitted data,  how about adding a new role that  can update and reload the server configuration?


But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.

Supportability and operability are more important in corner cases to quickly troubleshoot an issue,
 

Best regards, Andrey Borodin.
On Tue, Nov 29, 2022 at 10:45 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
>>
>> On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
>>>
>>> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
>>> >     2. Process proc die immediately when a backend is waiting for sync
>>> >     replication acknowledgement, as it does today, however, upon restart,
>>> >     don't open up for business (don't accept ready-only connections)
>>> >     unless the sync standbys have caught up.
>>> >
>>> > Are you planning to block connections or queries to the database? It would be
>>> > good to allow connections and let them query the monitoring views but block the
>>> > queries until sync standby have caught up. Otherwise, this leaves a monitoring
>>> > hole. In cloud, I presume superusers are allowed to connect and monitor (end
>>> > customers are not the role members and can't query the data). The same can't be
>>> > true for all the installations. Could you please add more details on your
>>> > approach?
>>>
>>> I think ALTER SYSTEM should be allowed, particularly so you can modify
>>> synchronous_standby_names, no?
>>
>> Yes, Change in synchronous_standby_names is expected in this situation. IMHO, blocking all the connections is not a
recommendedapproach.
 
>
> How about allowing superusers (they can still read locally committed data) and users part of pg_monitor role?

I started to spend time on this feature again. Thanks all for your
comments so far.

Per latest comments, it looks like we're mostly okay to emit a warning
and ignore query cancel interrupts while waiting for sync replication
ACK.

For proc die, it looks like the suggestion was to process it
immediately and upon next restart, don't allow user connections unless
all sync standbys were caught up. However, we need to be able to allow
replication connections from standbys so that they'll be able to
stream the needed WAL and catch up with primary, allow superuser or
users with pg_monitor role to connect to perform ALTER SYSTEM to
remove the unresponsive sync standbys if any from the list or disable
sync replication altogether or monitor for flush lsn/catch up status.
And block all other connections. Note that replication, superuser and
users with pg_monitor role connections are allowed only after the
server reaches a consistent state not before that to not read any
inconsistent data.

The trickiest part of doing the above is how we detect upon restart
that the server received proc die while waiting for sync replication
ACK. One idea might be to set a flag in the control file before the
crash. Second idea might be to write a marker file (although I don't
favor this idea); presence indicates that the server was waiting for
sync replication ACK before the crash. However, we may not detect all
sorts of crashes in a backend when it is waiting for sync replication
ACK to do any of these two ideas. Therefore, this may not be a
complete solution.

Third idea might be to just let the primary wait for sync standbys to
catch up upon restart irrespective of whether it was crashed or not
while waiting for sync replication ACK. While this idea works well
without having to detect all sorts of crashes, the primary may not
come up if any unresponsive standbys are present (currently, the
primary continues to be operational for read-only queries at least
irrespective of whether sync standbys have caught up or not).

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



> On 30 Jan 2023, at 06:55, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

> I started to spend time on this feature again. Thanks all for your
> comments so far.

Since there hasn't been any updates for the past six months, and the patch
hasn't applied for a few months, I am marking this returned with feedback for
now.  Please feel free to open a new entry in a future CF for this patch when
there is a new version.

--
Daniel Gustafsson