Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date
Msg-id CALj2ACUiyE1ui4Daqw1cw-tLTSqkiB1bYZ6rWb5BE70-C65Cog@mail.gmail.com
Whole thread Raw
In response to Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > 25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
> > >
> > > Hm, after thinking for a while, I tend to agree with the above
> > > approach - meaning, query cancel interrupt processing can completely
> > > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > > immediately, this approach requires no GUC as opposed to the proposed
> > > v1 patch upthread.
> > GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is
finetoo. If we do not allow cancelation of unreplicated backends, of course. 
> >
> > >>
> > >> And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in
presenceof a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of
timelineend is successfully replicated to synchronous_standby_names. 
> > >
> > > Hm, that needs to be done anyways. How about doing as proposed
> > > initially upthread [1]? Also, quoting the idea here [2].
> > >
> > > Thoughts?
> > >
> > > [1] https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
> > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > > in the next txn after the old locally committed txn was canceled. One
> > > way to achieve this is to let the backend, that's making the first
> > > connection, wait for sync standbys to catch up in ClientAuthentication
> > > right after successful authentication. However, I'm not sure this is
> > > the best way to do it at this point.
> >
> >
> > I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is
notreplicated to quorum al least up until new timeline LSN. 
>
> We can't do it in CheckRecoveryConsistency() unless I'm missing
> something. Because, the walsenders (required for sending the remaining
> WAL to sync standbys to achieve quorum) can only be started after the
> server reaches a consistent state, after all walsenders are
> specialized backends.

Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart -  roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:

if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
      SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
      shmem->wait_for_sync_repl_upon_restart = false;
}

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: collate not support Unicode Variation Selector
Next
From: Michael Paquier
Date:
Subject: Re: Fix obsoleted comments for function prototypes