Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication |
Date | |
Msg-id | CALj2ACVW1b7ue2qskO-Mef6975Mf3QZJs+47sHAgk8QB-bmDMA@mail.gmail.com Whole thread Raw |
In response to | Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication (SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>) |
Responses |
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
|
List | pgsql-hackers |
On Tue, Nov 29, 2022 at 10:45 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > > On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: >> >> On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote: >>> >>> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote: >>> > 2. Process proc die immediately when a backend is waiting for sync >>> > replication acknowledgement, as it does today, however, upon restart, >>> > don't open up for business (don't accept ready-only connections) >>> > unless the sync standbys have caught up. >>> > >>> > Are you planning to block connections or queries to the database? It would be >>> > good to allow connections and let them query the monitoring views but block the >>> > queries until sync standby have caught up. Otherwise, this leaves a monitoring >>> > hole. In cloud, I presume superusers are allowed to connect and monitor (end >>> > customers are not the role members and can't query the data). The same can't be >>> > true for all the installations. Could you please add more details on your >>> > approach? >>> >>> I think ALTER SYSTEM should be allowed, particularly so you can modify >>> synchronous_standby_names, no? >> >> Yes, Change in synchronous_standby_names is expected in this situation. IMHO, blocking all the connections is not a recommendedapproach. > > How about allowing superusers (they can still read locally committed data) and users part of pg_monitor role? I started to spend time on this feature again. Thanks all for your comments so far. Per latest comments, it looks like we're mostly okay to emit a warning and ignore query cancel interrupts while waiting for sync replication ACK. For proc die, it looks like the suggestion was to process it immediately and upon next restart, don't allow user connections unless all sync standbys were caught up. However, we need to be able to allow replication connections from standbys so that they'll be able to stream the needed WAL and catch up with primary, allow superuser or users with pg_monitor role to connect to perform ALTER SYSTEM to remove the unresponsive sync standbys if any from the list or disable sync replication altogether or monitor for flush lsn/catch up status. And block all other connections. Note that replication, superuser and users with pg_monitor role connections are allowed only after the server reaches a consistent state not before that to not read any inconsistent data. The trickiest part of doing the above is how we detect upon restart that the server received proc die while waiting for sync replication ACK. One idea might be to set a flag in the control file before the crash. Second idea might be to write a marker file (although I don't favor this idea); presence indicates that the server was waiting for sync replication ACK before the crash. However, we may not detect all sorts of crashes in a backend when it is waiting for sync replication ACK to do any of these two ideas. Therefore, this may not be a complete solution. Third idea might be to just let the primary wait for sync standbys to catch up upon restart irrespective of whether it was crashed or not while waiting for sync replication ACK. While this idea works well without having to detect all sorts of crashes, the primary may not come up if any unresponsive standbys are present (currently, the primary continues to be operational for read-only queries at least irrespective of whether sync standbys have caught up or not). Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: