Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
Date | |
Msg-id | Yz73ZM46ciZDEZUG@momjian.us Whole thread Raw |
In response to | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
List | pgsql-hackers |
On Thu, Oct 6, 2022 at 01:33:33PM +0530, Bharath Rupireddy wrote: > On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > As I highlighted above, by default you notify the administrator that a > > sychronous replica is not responding and then ignore it. If it becomes > > responsive again, you notify the administrator again and add it back as > > a sychronous replica. > > > > > command in any form may pose security risks. I'm not sure at this > > > point how this new timeout is going to work alongside > > > wal_sender_timeout. > > > > We have archive_command, so I don't see a problem with another shell > > command. > > Why do we need a new command to inform the admin/user about a sync > replication being ignored (from sync quorum) for not responding or > acknowledging for a certain amount of time in SyncRepWaitForLSN()? > Can't we just add an extra column or use existing sync_state in > pg_stat_replication()? We can either introduce a new state such as > temporary_async or just use the existing state 'potential' [1]. A > problem is that the server has to be monitored for this extra, new > state. If we do this, we don't need another command to report. Yes, that is a good point. I assumed people would want notification immediately rather than waiting for monitoring to notice it. Consider if you monitor every five seconds but the primary loses sync and goes down during that five-second interval --- there would be no way to know if sync stopped and reported committed transactions to the client before the primary went down. I would love to just rely on monitoring but I am not sure that is sufficient for this use-case. Of course, if email is being sent it might be still in the email queue when the primary goes down, but I guess if I was doing it I would make sure the email was delivered _before_ returning. The point is that we would not disable the sync and acknowledge the commit to the client until the notification command returns success --- that kind of guarantee is hard to do with monitoring. These are good discussions to have --- maybe I am wrong. > > > > Once we have that, we can consider removing the cancel ability while > > > > waiting for synchronous replicas (since we have the timeout) or make it > > > > optional. We can also consider how do notify the administrator during > > > > query cancel (if we allow it), backend abrupt exit/crash, and > > > > > > Yeah. If we have the > > > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution, > > > the users can then choose to disable processing query cancels/proc > > > dies while waiting for sync replication in SyncRepWaitForLSN(). > > > > Yes. We might also change things so a query cancel that happens during > > sychronous replica waiting can only be done by an administrator, not the > > session owner. Again, lots of design needed here. > > Yes, we need infrastructure to track who issued the query cancel or > proc die and so on. IMO, it's not a good way to allow/disallow query > cancels or CTRL+C based on role types - superusers or users with > replication roles or users who are members of any of predefined roles. > > In general, it is the walsender serving sync standby that has to mark > itself as async standby by removing itself from > synchronous_standby_names, reloading config variables and waking up > the backends that are waiting in syncrep wait queue for it to update > LSN. > > And, the new auto removal timeout should always be set to less than > wal_sender_timeout. > > All that said, imagine we have > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution > in one or the other forms with auto removal timeout set to 5 minutes, > any of following can happen: > > 1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(), > no query cancel or proc die interrupt is arrived, the sync standby is > made as async standy after the timeout i.e. 5 minutes. > 2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(), > say for about 3 minutes, then query cancel or proc die interrupt is > arrived, should we immediately process it or wait for timeout to > happen (2 more minutes) and then process the interrupt? If we > immediately process the interrupts, then the > locally-committed-but-not-replicated-to-sync-standby problems > described upthread [2] are left unresolved. I have a feeling once we have the timeout, we would disable query cancel when we are in this stage since it is canceling a committed query. The timeout would cancel the sync but at least the administrator would know. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
pgsql-hackers by date: