Thread: Synchronous replication
Hi, The attached patch provides core of synchronous replication feature based on streaming replication. I added this patch into CF 2010-07. The code is also available in my git repository: git://git.postgresql.org/git/users/fujii/postgres.git branch: synchrep Synchronization levels ---------------------- The patch provides replication_mode parameter in recovery.conf, which specifies the replication mode which can control how long transaction commit on the master server waits for replication before the command returns a "success" indication to the client. Valid modes are: 1. async doesn't make transaction commit wait for replication, i.e., asynchronous replication. This mode has been already supported in 9.0. 2. recv makes transaction commit wait until the standby has received WAL records. 3. fsync makes transaction commit wait until the standby has received and flushed WAL records to disk 4. replay makes transaction commit wait until the standby has replayed WAL records after receiving and flushing them to disk You can choose the synchronization level per standby. Quorum commit ------------- In previous discussion about synchronous replication, some people wanted the quorum commit feature. This feature is included in also Zontan's synchronous replication patch, so I decided to create it. The patch provides quorum parameter in postgresql.conf, which specifies how many standby servers transaction commit will wait for WAL records to be replicated to, before the command returns a "success" indication to the client. The default value is zero, which always doesn't make transaction commit wait for replication without regard to replication_mode. Also transaction commit always doesn't wait for replication to asynchronous standby (i.e., replication_mode is set to async) without regard to this parameter. If quorum is more than the number of synchronous standbys, transaction commit returns a "success" when the ACK has arrived from all of synchronous standbys. Currently quorum parameter is defined as PGC_USERSET. You can have some transactions replicate synchronously and others asynchronously. Protocol -------- I extended the handshake message "START_REPLICATION" so that it includes replication_mode read from recovery.conf. If 'async' is passed, the master thinks that it doesn't need to wait for the ACK from the standby. I added XLogRecPtr message, which is used to send the ACK meaning completion of replication from walreceiver to walsender. If replication_mode = 'async', this message is never sent. XLogRecPtr message always includes the current receive location if mode is 'recv', the current flush location if mode is 'fsync' and the current replay location if mode is 'replay'. Then, if the location in the ACK is more than or equal to the location of the COMMIT record, transaction breaks out of the wait-loop and returns a "success" to the client. TODO ---- The patch have no features for performance improvement of synchronous replication. I admit that currently the performance overhead in the master is terrible. We need to address the following TODO items in the subsequent CF. * Change the poll loop in the walsender * Change the poll loop in the backend * Change the poll loop in the startup process * Change the poll loop in the walreceiver * Perform the WAL write and replication concurrently * Send WAL from not only disk but also WAL buffers For the case where the network outage happens or the standby fails, we should expose the maximum time to wait for replication, as a parameter. Furthermore you might want to specify the reaction to the timeout. These are also not in the patch, so we need to address them in the subsequent CF, too. In synchronous replication, it's important to check whether the standby has been sync with the master. But such a monitoring feature is also not in the patch. That's TODO. It would be difficult to commit whole of synchronous replication feature at one time. I'm planning to develop it by stages. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, Jul 14, 2010 at 2:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > The patch have no features for performance improvement of synchronous > replication. I admit that currently the performance overhead in the > master is terrible. We need to address the following TODO items in the > subsequent CF. > > * Change the poll loop in the walsender > * Change the poll loop in the backend > * Change the poll loop in the startup process > * Change the poll loop in the walreceiver > * Perform the WAL write and replication concurrently > * Send WAL from not only disk but also WAL buffers I have a feeling that if we don't have a design for these last two before we start committing things, we're possibly going to regret it later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Jul 15, 2010 at 12:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 14, 2010 at 2:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> The patch have no features for performance improvement of synchronous >> replication. I admit that currently the performance overhead in the >> master is terrible. We need to address the following TODO items in the >> subsequent CF. >> >> * Change the poll loop in the walsender >> * Change the poll loop in the backend >> * Change the poll loop in the startup process >> * Change the poll loop in the walreceiver >> * Perform the WAL write and replication concurrently >> * Send WAL from not only disk but also WAL buffers > > I have a feeling that if we don't have a design for these last two > before we start committing things, we're possibly going to regret it > later. Yeah, I'll give it a try. The problem is that the standby can apply the non-fsync'd WAL on the master. So if we allow walsender to send the non-fsync'd WAL, we should make walsender send also the current fsync location and prevent the standby from applying the newer WAL than the fsync location. New message type for sending the fsync location would be required in Streaming Replication Protocol. But sometimes it might go along with XLogData message. After the master crashes and walreceiver is terminated, currently the standby attempts to replay the WAL in the pg_xlog and the archive. Since WAL in the archive is guaranteed to have already been fsync'd by the master, it's not problem for the standby to apply that WAL. OTOH, WAL records in pg_xlog directory might not exist in the crashed master. So we should always prevent the standby from applying any WAL in pg_xlog unless walreceiver is in progress. That is, if there is no WAL available in the archive, the standby ignores pg_xlog and starts walreceiver process to request for WAL streaming. This idea is a little inefficient because the already-sent WAL might be sent again when the master is restarted. But since this ensures that the standby will not apply the non-fsync'd WAL on the master, it's quite safe. What about this idea? This idea doesn't conflict with the patch I submitted for CF 2010-07. So please feel free to review the patch :) But if you think that the patch is not reviewable until that idea has been implemented, I'll try to implement that ASAP. PS. Probably I cannot reply to the mail until July 21. Sorry. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 16/07/10 10:40, Fujii Masao wrote: > So we should always prevent the standby from applying any WAL in pg_xlog > unless walreceiver is in progress. That is, if there is no WAL available > in the archive, the standby ignores pg_xlog and starts walreceiver > process to request for WAL streaming. That completely defeats the purpose of storing streamed WAL in pg_xlog in the first place. The reason it's written and fsync'd to pg_xlog is that if the standby subsequently crashes, you can use the WAL from pg_xlog to reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the standby anymore. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Le 16 juil. 2010 à 12:43, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> a écrit :
On 16/07/10 10:40, Fujii Masao wrote:So we should always prevent the standby from applying any WAL in pg_xlogunless walreceiver is in progress. That is, if there is no WAL availablein the archive, the standby ignores pg_xlog and starts walreceiverprocess to request for WAL streaming.
That completely defeats the purpose of storing streamed WAL in pg_xlog in the first place. The reason it's written and fsync'd to pg_xlog is that if the standby subsequently crashes, you can use the WAL from pg_xlog to reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the standby anymore.
I guess we know for sure that this point has been fsync()ed on the Master, or that we could arrange it so that we know that?
On 16/07/10 20:26, Dimitri Fontaine wrote: > Le 16 juil. 2010 à 12:43, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> a écrit : > >> On 16/07/10 10:40, Fujii Masao wrote: >>> So we should always prevent the standby from applying any WAL in pg_xlog >>> unless walreceiver is in progress. That is, if there is no WAL available >>> in the archive, the standby ignores pg_xlog and starts walreceiver >>> process to request for WAL streaming. >> >> That completely defeats the purpose of storing streamed WAL in pg_xlog in the first place. The reason it's written andfsync'd to pg_xlog is that if the standby subsequently crashes, you can use the WAL from pg_xlog to reapply the WAL upto minRecoveryPoint. Otherwise you can't start up the standby anymore. > > I guess we know for sure that this point has been fsync()ed on the Master, or that we could arrange it so that we knowthat? At the moment we only stream WAL that's already been fsync()ed on the master, so we don't have this problem, but Fujii is proposing to change that. I think that's a premature optimization, and we should not try to change that. There is no evidence from field (granted, streaming replication is a new feature) or from performance tests that it is a problem in practice, or that sending WAL earlier would help. Let's concentrate on the bare minimum required to make synchronous replication work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 14/07/10 09:50, Fujii Masao wrote: > TODO > ---- > The patch have no features for performance improvement of synchronous > replication. I admit that currently the performance overhead in the > master is terrible. We need to address the following TODO items in the > subsequent CF. > > * Change the poll loop in the walsender > * Change the poll loop in the backend > * Change the poll loop in the startup process > * Change the poll loop in the walreceiver I was actually hoping to see a patch for these things first, before any of the synchronous replication stuff. Eliminating the polling loops is important, latency will be laughable otherwise, and it will help the synchronous case too. > * Perform the WAL write and replication concurrently > * Send WAL from not only disk but also WAL buffers IMHO these are premature optimizations that we should not spend any effort on now. Maybe later, if ever. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 14/07/10 09:50, Fujii Masao wrote: > Quorum commit > ------------- > In previous discussion about synchronous replication, some people > wanted the quorum commit feature. This feature is included in also > Zontan's synchronous replication patch, so I decided to create it. > > The patch provides quorum parameter in postgresql.conf, which > specifies how many standby servers transaction commit will wait for > WAL records to be replicated to, before the command returns a > "success" indication to the client. The default value is zero, which > always doesn't make transaction commit wait for replication without > regard to replication_mode. Also transaction commit always doesn't > wait for replication to asynchronous standby (i.e., replication_mode > is set to async) without regard to this parameter. If quorum is more > than the number of synchronous standbys, transaction commit returns > a "success" when the ACK has arrived from all of synchronous standbys. There should be a way to specify "wait for *all* connected standby servers to acknowledge" > Protocol > -------- > I extended the handshake message "START_REPLICATION" so that it > includes replication_mode read from recovery.conf. If 'async' is > passed, the master thinks that it doesn't need to wait for the ACK > from the standby. Please use self-explanatory names for the modes in START_REPLICATION command, instead of just an integer. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jul 16, 2010 at 7:43 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 16/07/10 10:40, Fujii Masao wrote: >> >> So we should always prevent the standby from applying any WAL in pg_xlog >> unless walreceiver is in progress. That is, if there is no WAL available >> in the archive, the standby ignores pg_xlog and starts walreceiver >> process to request for WAL streaming. > > That completely defeats the purpose of storing streamed WAL in pg_xlog in > the first place. The reason it's written and fsync'd to pg_xlog is that if > the standby subsequently crashes, you can use the WAL from pg_xlog to > reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the > standby anymore. But, the standby can start up by reading the missing WAL files from the master. No? On the second thought, minRecoveryPoint can be guaranteed to be older than the fsync location on the master if we'll prevent the standby from applying the WAL files more than the fsync location. So we can safely apply the WAL files in pg_xlog up to minRecoveryPoint. Consequently, we should always prevent the standby from applying any newer WAL in pg_xlog than minRecoveryPoint unless walreceiver is in progress. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Jul 17, 2010 at 3:25 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 14/07/10 09:50, Fujii Masao wrote: >> >> TODO >> ---- >> The patch have no features for performance improvement of synchronous >> replication. I admit that currently the performance overhead in the >> master is terrible. We need to address the following TODO items in the >> subsequent CF. >> >> * Change the poll loop in the walsender >> * Change the poll loop in the backend >> * Change the poll loop in the startup process >> * Change the poll loop in the walreceiver > > I was actually hoping to see a patch for these things first, before any of > the synchronous replication stuff. Eliminating the polling loops is > important, latency will be laughable otherwise, and it will help the > synchronous case too. At first, note that the poll loop in the backend and walreceiver doesn't exist without synchronous replication stuff. Yeah, I'll start with the change of the poll loop in the walsender. I'm thinking that we should make the backend signal the walsender to send the outstanding WAL immediately as the previous synchronous replication patch I submitted in the past year did. I use the signal here because walsender needs to wait for the request from the backend and the ack message from the standby *concurrently* in synchronous replication. If we use the semaphore instead of the signal, the walsender would not be able to respond the ack immediately, which also degrades the performance. The problem of this idea is that signal can be sent per transaction commit. I'm not sure if this frequent signaling really harms the performance of replication. BTW, when I benchmarked the previous synchronous replication patch based on the idea, AFAIR the result showed no impact of the signaling. But... Thought? Do you have another better idea? >> * Perform the WAL write and replication concurrently >> * Send WAL from not only disk but also WAL buffers > > IMHO these are premature optimizations that we should not spend any effort > on now. Maybe later, if ever. Yep! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Jul 18, 2010 at 3:14 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 14/07/10 09:50, Fujii Masao wrote: >> >> Quorum commit >> ------------- >> In previous discussion about synchronous replication, some people >> wanted the quorum commit feature. This feature is included in also >> Zontan's synchronous replication patch, so I decided to create it. >> >> The patch provides quorum parameter in postgresql.conf, which >> specifies how many standby servers transaction commit will wait for >> WAL records to be replicated to, before the command returns a >> "success" indication to the client. The default value is zero, which >> always doesn't make transaction commit wait for replication without >> regard to replication_mode. Also transaction commit always doesn't >> wait for replication to asynchronous standby (i.e., replication_mode >> is set to async) without regard to this parameter. If quorum is more >> than the number of synchronous standbys, transaction commit returns >> a "success" when the ACK has arrived from all of synchronous standbys. > > There should be a way to specify "wait for *all* connected standby servers > to acknowledge" Agreed. I'll allow -1 as the valid value of the quorum parameter, which means that transaction commit waits for all connected standbys. >> Protocol >> -------- >> I extended the handshake message "START_REPLICATION" so that it >> includes replication_mode read from recovery.conf. If 'async' is >> passed, the master thinks that it doesn't need to wait for the ACK >> from the standby. > > Please use self-explanatory names for the modes in START_REPLICATION > command, instead of just an integer. Agreed. What about changing the START_REPLICATION message to?: START_REPLICATION XXX/XXX SYNC_LEVEL { async | recv | fsync | replay } Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
* Fujii Masao <masao.fujii@gmail.com> [100721 03:49]: > >> The patch provides quorum parameter in postgresql.conf, which > >> specifies how many standby servers transaction commit will wait for > >> WAL records to be replicated to, before the command returns a > >> "success" indication to the client. The default value is zero, which > >> always doesn't make transaction commit wait for replication without > >> regard to replication_mode. Also transaction commit always doesn't > >> wait for replication to asynchronous standby (i.e., replication_mode > >> is set to async) without regard to this parameter. If quorum is more > >> than the number of synchronous standbys, transaction commit returns > >> a "success" when the ACK has arrived from all of synchronous standbys. > > > > There should be a way to specify "wait for *all* connected standby servers > > to acknowledge" > > Agreed. I'll allow -1 as the valid value of the quorum parameter, which > means that transaction commit waits for all connected standbys. Hm... so if my 1 synchronouse standby is operatign normally, and quarum is set to 1, I'll get what I want (commit waits until it's safely on both servers). But what happens if my standby goes bad. Suddenly the quarum setting is ignored (because it's > number of connected standby servers?) Is there a way for me to not allow any commits if the quarum setting number of standbies is *not* availble? Yes, I want my db to "halt" in that situation, and yes, alarmbells will be ringing... In reality, I'm likely to run 2 synchronous slaves, with quarum of 1. So 1 slave can fail an dI can still have 2 going. But if that 2nd slave ever failed while the other was down, I definately don't want the master to forge on ahead! Of course, this won't be for everyone, just as the current "just connected standbys" isn't for everything either... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, Jul 21, 2010 at 9:52 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Fujii Masao <masao.fujii@gmail.com> [100721 03:49]: > >> >> The patch provides quorum parameter in postgresql.conf, which >> >> specifies how many standby servers transaction commit will wait for >> >> WAL records to be replicated to, before the command returns a >> >> "success" indication to the client. The default value is zero, which >> >> always doesn't make transaction commit wait for replication without >> >> regard to replication_mode. Also transaction commit always doesn't >> >> wait for replication to asynchronous standby (i.e., replication_mode >> >> is set to async) without regard to this parameter. If quorum is more >> >> than the number of synchronous standbys, transaction commit returns >> >> a "success" when the ACK has arrived from all of synchronous standbys. >> > >> > There should be a way to specify "wait for *all* connected standby servers >> > to acknowledge" >> >> Agreed. I'll allow -1 as the valid value of the quorum parameter, which >> means that transaction commit waits for all connected standbys. > > Hm... so if my 1 synchronouse standby is operatign normally, and quarum > is set to 1, I'll get what I want (commit waits until it's safely on both > servers). But what happens if my standby goes bad. Suddenly the quarum > setting is ignored (because it's > number of connected standby > servers?) Is there a way for me to not allow any commits if the quarum > setting number of standbies is *not* availble? Yes, I want my db to > "halt" in that situation, and yes, alarmbells will be ringing... > > In reality, I'm likely to run 2 synchronous slaves, with quarum of 1. > So 1 slave can fail an dI can still have 2 going. But if that 2nd slave > ever failed while the other was down, I definately don't want the master > to forge on ahead! > > Of course, this won't be for everyone, just as the current "just > connected standbys" isn't for everything either... Yeah, we need to clear up the detailed design of quorum commit feature, and reach consensus on that. How should the synchronous replication behave when the number of connected standby servers is less than quorum? 1. Ignore quorum. The current patch adopts this. If the ACKs from all connected standbys have arrived, transaction commitis successful even if the number of standbys is less than quorum. If there is no connected standby, transaction commitalways is successful without regard to quorum. 2. Observe quorum. Aidan wants this. Until the number of connected standbys has become more than or equal to quorum, transactioncommit waits. Which is the right behavior of quorum commit? Or we should add new parameter specifying the behavior of quorum commit? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 21, 2010 at 4:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> There should be a way to specify "wait for *all* connected standby servers >> to acknowledge" > > Agreed. I'll allow -1 as the valid value of the quorum parameter, which > means that transaction commit waits for all connected standbys. Done. >> Please use self-explanatory names for the modes in START_REPLICATION >> command, instead of just an integer. > > Agreed. What about changing the START_REPLICATION message to?: > > START_REPLICATION XXX/XXX SYNC_LEVEL { async | recv | fsync | replay } Done. I attached the updated version of the patch. The code is also available in my git repository: git://git.postgresql.org/git/users/fujii/postgres.git branch: synchrep Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > How should the synchronous replication behave when the number of connected > standby servers is less than quorum? > > 1. Ignore quorum. The current patch adopts this. If the ACKs from all > connected standbys have arrived, transaction commit is successful > even if the number of standbys is less than quorum. If there is no > connected standby, transaction commit always is successful without > regard to quorum. > > 2. Observe quorum. Aidan wants this. Until the number of connected > standbys has become more than or equal to quorum, transaction commit > waits. > > Which is the right behavior of quorum commit? Or we should add new > parameter specifying the behavior of quorum commit? > Initially I also expected the quorum to behave like described by Aidan/option 2. Also, IMHO the name "quorom" is a bit short, like having "maximum" but not saying a max_something. quorum_min_sync_standbys quorum_max_sync_standbys The question remains what are the sync standbys? Does it mean not-async? Intuitively by looking at the enumeration of replication_mode I'd think that the sync standbys are all standby's that operate in a not async mode. That would be clearer with a boolean sync (or not) and for sync standbys the replication_mode specified. regards, Yeb Havinga
On Thu, Jul 22, 2010 at 5:37 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Fujii Masao wrote: >> >> How should the synchronous replication behave when the number of connected >> standby servers is less than quorum? >> >> 1. Ignore quorum. The current patch adopts this. If the ACKs from all >> connected standbys have arrived, transaction commit is successful >> even if the number of standbys is less than quorum. If there is no >> connected standby, transaction commit always is successful without >> regard to quorum. >> >> 2. Observe quorum. Aidan wants this. Until the number of connected >> standbys has become more than or equal to quorum, transaction commit >> waits. >> >> Which is the right behavior of quorum commit? Or we should add new >> parameter specifying the behavior of quorum commit? >> > > Initially I also expected the quorum to behave like described by > Aidan/option 2. OK. But some people (including me) would like to prevent the master from halting when the standby fails, so I think that 1. also should be supported. So I'm inclined to add new parameter specifying the behavior of quorum commit when the number of synchronous standbys becomes less than quorum. > Also, IMHO the name "quorom" is a bit short, like having > "maximum" but not saying a max_something. > > quorum_min_sync_standbys > quorum_max_sync_standbys What about quorum_standbys? > The question remains what are the sync standbys? Does it mean not-async? It's the standby which sets replication_mode to "recv", "fsync", or "replay". > Intuitively by looking at the enumeration of replication_mode I'd think that > the sync standbys are all standby's that operate in a not async mode. That > would be clearer with a boolean sync (or not) and for sync standbys the > replication_mode specified. You mean that something like synchronous_replication as the recovery.conf parameter should be added in addition to replication_mode? Since increasing the number of similar parameters would confuse users, I don't like do that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: >> Intuitively by looking at the enumeration of replication_mode I'd think that >> the sync standbys are all standby's that operate in a not async mode. That >> would be clearer with a boolean sync (or not) and for sync standbys the >> replication_mode specified. >> > > You mean that something like synchronous_replication as the recovery.conf > parameter should be added in addition to replication_mode? Since increasing > the number of similar parameters would confuse users, I don't like do that. > I think what would be confusing if there is a mismatch between implemented concepts and parameters. 1 does the master wait for standby servers on commit? 2 how many acknowledgements must the master receive before it can continue? 3 is a standby server a synchronous one, i.e. does it acknowledge a commit? 4 when do standby servers acknowledge a commit? 5 does it only wait when the standby's are connected, or also when they are not connected? 6..? When trying to match parameter names for the concepts above: 1 - does not exist, but can be answered with quorum_standbys = 0 2 - quorum_standbys 3 - yes, if replication_mode != async (here is were I thought I had to think to much) 4 - replication modes recv, fsync and replay bot not async 5 - Zoltan's strict_sync_replication parameter Just an idea, what about for 4: acknowledge_commit = {no|recv|fsync|replay} then 3 = yes, if acknowledge_commit != no regards, Yeb Havinga
On Mon, Jul 26, 2010 at 5:27 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Fujii Masao wrote: >>> >>> Intuitively by looking at the enumeration of replication_mode I'd think >>> that >>> the sync standbys are all standby's that operate in a not async mode. >>> That >>> would be clearer with a boolean sync (or not) and for sync standbys the >>> replication_mode specified. >>> >> >> You mean that something like synchronous_replication as the recovery.conf >> parameter should be added in addition to replication_mode? Since >> increasing >> the number of similar parameters would confuse users, I don't like do >> that. >> > > I think what would be confusing if there is a mismatch between implemented > concepts and parameters. > > 1 does the master wait for standby servers on commit? > 2 how many acknowledgements must the master receive before it can continue? > 3 is a standby server a synchronous one, i.e. does it acknowledge a commit? > 4 when do standby servers acknowledge a commit? > 5 does it only wait when the standby's are connected, or also when they are > not connected? > 6..? > > When trying to match parameter names for the concepts above: > 1 - does not exist, but can be answered with quorum_standbys = 0 > 2 - quorum_standbys > 3 - yes, if replication_mode != async (here is were I thought I had to think > to much) > 4 - replication modes recv, fsync and replay bot not async > 5 - Zoltan's strict_sync_replication parameter > > Just an idea, what about > for 4: acknowledge_commit = {no|recv|fsync|replay} > then 3 = yes, if acknowledge_commit != no Thanks for the clarification. I still like replication_mode = {async|recv|fsync|replay} rather than synchronous_replication = {on|off} acknowledge_commit = {no|recv|fsync|replay} because the former is more intuitive for me and I don't want to increase the number of parameters. We need to hear from some users in this respect. If most want the latter, of course, I'd love to adopt it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > I still like > > replication_mode = {async|recv|fsync|replay} > > rather than > > synchronous_replication = {on|off} > acknowledge_commit = {no|recv|fsync|replay} > Hello Fujii, I wasn't entirely clear. My suggestion was to have only acknowledge_commit = {no|recv|fsync|replay} instead of replication_mode = {async|recv|fsync|replay} regards, Yeb Havinga
On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Fujii Masao wrote: >> >> I still like >> >> replication_mode = {async|recv|fsync|replay} >> >> rather than >> >> synchronous_replication = {on|off} >> acknowledge_commit = {no|recv|fsync|replay} >> > > Hello Fujii, > > I wasn't entirely clear. My suggestion was to have only > > acknowledge_commit = {no|recv|fsync|replay} > > instead of > > replication_mode = {async|recv|fsync|replay} Okay, I'll change the patch accordingly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 7/26/10 1:44 PM +0300, Fujii Masao wrote: > On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga@gmail.com> wrote: >> I wasn't entirely clear. My suggestion was to have only >> >> acknowledge_commit = {no|recv|fsync|replay} >> >> instead of >> >> replication_mode = {async|recv|fsync|replay} > > Okay, I'll change the patch accordingly. For what it's worth, I think replication_mode is a lot clearer. Acknowledge_commit sounds like it would do something similar to asynchronous_commit. Regards, Marko Tiikkaja
On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> wrote: > On 7/26/10 1:44 PM +0300, Fujii Masao wrote: >> >> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga@gmail.com> wrote: >>> >>> I wasn't entirely clear. My suggestion was to have only >>> >>> acknowledge_commit = {no|recv|fsync|replay} >>> >>> instead of >>> >>> replication_mode = {async|recv|fsync|replay} >> >> Okay, I'll change the patch accordingly. > > For what it's worth, I think replication_mode is a lot clearer. > Acknowledge_commit sounds like it would do something similar to > asynchronous_commit. I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Jul 22, 2010 at 10:37:12AM +0200, Yeb Havinga wrote: > Fujii Masao wrote: > Initially I also expected the quorum to behave like described by > Aidan/option 2. Also, IMHO the name "quorom" is a bit short, like having > "maximum" but not saying a max_something. > > quorum_min_sync_standbys > quorum_max_sync_standbys Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum idea is really the best thing for us. I've been thinking about Oracle's way of doing things[1]. In short, there are three different modes: availability, performance, and protection. "Protection" appears to mean that at least one standby has applied the log; "availability" means at least one standby has received the log info (it doesn't specify whether that info has been fsynced or applied, but presumably does not mean "applied", since it's distinct from "protection" mode); "performance" means replication is asynchronous. I'm not sure this method is perfect, but it might be simpler than the quorum behavior that has been considered, and adequate for actual use cases. [1] http://download.oracle.com/docs/cd/B28359_01/server.111/b28294/protection.htm#SBYDB02000 alternatively, http://is.gd/dLkq4 -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
On Tue, Jul 27, 2010 at 12:36 PM, Joshua Tolley <eggyknap@gmail.com> wrote: > Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum > idea is really the best thing for us. I've been thinking about Oracle's way of > doing things[1]. In short, there are three different modes: availability, > performance, and protection. "Protection" appears to mean that at least one > standby has applied the log; "availability" means at least one standby has > received the log info (it doesn't specify whether that info has been fsynced > or applied, but presumably does not mean "applied", since it's distinct from > "protection" mode); "performance" means replication is asynchronous. I'm not > sure this method is perfect, but it might be simpler than the quorum behavior > that has been considered, and adequate for actual use cases. In my case, I'd like to set up one synchronous standby on the near rack for high-availability, and one asynchronous standby on the remote site for disaster recovery. Can Oracle's way cover the case? "availability" mode with two standbys might create a sort of similar situation. That is, since the ACK from the near standby arrives in first, the near standby acts synchronous and the remote one does asynchronous. But the ACK from the remote standby can arrive in first, so it's not guaranteed that the near standby has received the log info before transaction commit returns a "success" to the client. In this case, we have to failover to the remote standby even if it's not under control of a clusterware. This is a problem for me. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jul 26, 2010 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja > <marko.tiikkaja@cs.helsinki.fi> wrote: >> On 7/26/10 1:44 PM +0300, Fujii Masao wrote: >>> >>> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga@gmail.com> wrote: >>>> >>>> I wasn't entirely clear. My suggestion was to have only >>>> >>>> acknowledge_commit = {no|recv|fsync|replay} >>>> >>>> instead of >>>> >>>> replication_mode = {async|recv|fsync|replay} >>> >>> Okay, I'll change the patch accordingly. >> >> For what it's worth, I think replication_mode is a lot clearer. >> Acknowledge_commit sounds like it would do something similar to >> asynchronous_commit. > > I agree. As the result of the vote, I'll leave the parameter "replication_mode" as it is. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 21, 2010 at 4:36 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> I was actually hoping to see a patch for these things first, before any of >> the synchronous replication stuff. Eliminating the polling loops is >> important, latency will be laughable otherwise, and it will help the >> synchronous case too. > > At first, note that the poll loop in the backend and walreceiver doesn't > exist without synchronous replication stuff. > > Yeah, I'll start with the change of the poll loop in the walsender. I'm > thinking that we should make the backend signal the walsender to send the > outstanding WAL immediately as the previous synchronous replication patch > I submitted in the past year did. I use the signal here because walsender > needs to wait for the request from the backend and the ack message from > the standby *concurrently* in synchronous replication. If we use the > semaphore instead of the signal, the walsender would not be able to > respond the ack immediately, which also degrades the performance. > > The problem of this idea is that signal can be sent per transaction commit. > I'm not sure if this frequent signaling really harms the performance of > replication. BTW, when I benchmarked the previous synchronous replication > patch based on the idea, AFAIR the result showed no impact of the > signaling. But... Thought? Do you have another better idea? The attached patch changes the backend so that it signals walsender to wake up from the sleep and send WAL immediately. It doesn't include any other synchronous replication stuff. The signal is sent right after a COMMIT, PREPARE TRANSACTION, COMMIT PREPARED or ABORT PREPARED record has been fsync'd. To suppress redundant signaling, I added the flag which indicates whether walsender is ready for sending WAL up to the currently-fsync'd location. Only when the flag is false, the backend sets it to true and sends the signal to walsender. When the flag is true, the signal doesn't need to be sent. The flag is set to false right before walsender sends WAL. The code is also available in my git repository: git://git.postgresql.org/git/users/fujii/postgres.git branch: wakeup-walsnd Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > On Mon, Jul 26, 2010 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja >> <marko.tiikkaja@cs.helsinki.fi> wrote: >> >>> On 7/26/10 1:44 PM +0300, Fujii Masao wrote: >>> >>>> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga@gmail.com> wrote: >>>> >>>>> I wasn't entirely clear. My suggestion was to have only >>>>> >>>>> acknowledge_commit = {no|recv|fsync|replay} >>>>> >>>>> instead of >>>>> >>>>> replication_mode = {async|recv|fsync|replay} >>>>> >>>> Okay, I'll change the patch accordingly. >>>> >>> For what it's worth, I think replication_mode is a lot clearer. >>> Acknowledge_commit sounds like it would do something similar to >>> asynchronous_commit. >>> >> I agree. >> > > As the result of the vote, I'll leave the parameter "replication_mode" > as it is. > I'd like to bring forward another suggestion (please tell me when it is becoming spam). My feeling about replication_mode as is, is that is says in the same parameter something about async or sync, as well as, if sync, which method of feedback to the master. OTOH having two parameters would need documentation that the feedback method may only be set if the replication_mode was sync, as well as checks. So it is actually good to have it all in one parameter But somehow the shoe pinches, because async feels different from the other three parameters. There is a way to move async out of the enumeration: synchronous_replication_mode = off | recv | fsync | replay This also looks a bit like the "synchronous_replication = N # similar in name to synchronous_commit" Simon Riggs proposed in http://archives.postgresql.org/pgsql-hackers/2010-05/msg01418.php regards, Yeb Havinga PS: Please bear with me, I thought a bit about a way to make clear what deduction users must make when figuring out if the replication mode is synchronous. That question might be important when counting 'which servers are the synchronous standbys' to debug quorum settings. replication_mode from the assumption !async -> sync and !async -> recv|fsync|replay to infer recv|fsync|replay -> synchronous_replication. synchronous_replication_mode from the assumption !off -> on and !off -> recv|fsync|replay to infer recv|fsync|replay -> synchronous_replication. I think the last one is easier made by humans, since everybody will make the !off-> on assumption, but not the !async -> sync without having that verified in the documentation.
Joshua Tolley wrote: > Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum > idea is really the best thing for us. For reference: it appeared in a long thread a while ago http://archives.postgresql.org/pgsql-hackers/2010-05/msg01226.php. > In short, there are three different modes: availability, > performance, and protection. "Protection" appears to mean that at least one > standby has applied the log; "availability" means at least one standby has > received the log info > Maybe we could do both, by describing use cases along the availability, performance and protection setups in the documentation and how they would be reflected with the standby related parameters. regards, Yeb Havinga
Fujii Masao wrote: > The attached patch changes the backend so that it signals walsender to > wake up from the sleep and send WAL immediately. It doesn't include any > other synchronous replication stuff. > Hello Fujii, I noted the changes in XlogSend where instead of *caughtup = true/false it now returns !MyWalSnd->sndrqst. That value is initialized to false in that procedure and it cannot be changed to true during execution of that procedure, or can it? regards, Yeb Havinga
On Tue, Jul 27, 2010 at 7:39 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Fujii Masao wrote: >> >> The attached patch changes the backend so that it signals walsender to >> wake up from the sleep and send WAL immediately. It doesn't include any >> other synchronous replication stuff. >> > > Hello Fujii, Thanks for the review! > I noted the changes in XlogSend where instead of *caughtup = true/false it > now returns !MyWalSnd->sndrqst. That value is initialized to false in that > procedure and it cannot be changed to true during execution of that > procedure, or can it? That value is set to true in WalSndWakeup(). If WalSndWakeup() is called after initialization of that value in XLogSend(), *caughtup is set to false. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jul 27, 2010 at 5:42 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > I'd like to bring forward another suggestion (please tell me when it is > becoming spam). My feeling about replication_mode as is, is that is says in > the same parameter something about async or sync, as well as, if sync, which > method of feedback to the master. OTOH having two parameters would need > documentation that the feedback method may only be set if the > replication_mode was sync, as well as checks. So it is actually good to have > it all in one parameter > > But somehow the shoe pinches, because async feels different from the other > three parameters. There is a way to move async out of the enumeration: > > synchronous_replication_mode = off | recv | fsync | replay ISTM that we need to get more feedback from users to determine which is the best. So, how about leaving the parameter as it is and revisiting this topic later? Since it's not difficult to change the parameter later, we will not regret even if we delay that determination. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: >> I noted the changes in XlogSend where instead of *caughtup = true/false it >> now returns !MyWalSnd->sndrqst. That value is initialized to false in that >> procedure and it cannot be changed to true during execution of that >> procedure, or can it? >> > > That value is set to true in WalSndWakeup(). If WalSndWakeup() is called > after initialization of that value in XLogSend(), *caughtup is set to false. > Ah, so it can be changed by another backend process. Another question: Is there a reason not to send the signal in XlogFlush itself, so it would be called at CreateCheckPoint(), EndPrepare(), FlushBuffer(), RecordTransactionAbortPrepared(), RecordTransactionCommit(), RecordTransactionCommitPrepared(), RelationTruncate(), SlruPhysicalWritePage(), write_relmap_file(), WriteTruncateXlogRec(), and xact_redo_commit(). regards, Yeb Havinga
On Tue, Jul 27, 2010 at 01:41:10PM +0900, Fujii Masao wrote: > On Tue, Jul 27, 2010 at 12:36 PM, Joshua Tolley <eggyknap@gmail.com> wrote: > > Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum > > idea is really the best thing for us. I've been thinking about Oracle's way of > > doing things[1]. In short, there are three different modes: availability, > > performance, and protection. "Protection" appears to mean that at least one > > standby has applied the log; "availability" means at least one standby has > > received the log info (it doesn't specify whether that info has been fsynced > > or applied, but presumably does not mean "applied", since it's distinct from > > "protection" mode); "performance" means replication is asynchronous. I'm not > > sure this method is perfect, but it might be simpler than the quorum behavior > > that has been considered, and adequate for actual use cases. > > In my case, I'd like to set up one synchronous standby on the near rack for > high-availability, and one asynchronous standby on the remote site for disaster > recovery. Can Oracle's way cover the case? I don't think it can support the case you're interested in, though I'm not terribly expert on it. I'm definitely not arguing for the syntax Oracle uses, or something similar; I much prefer the flexibility we're proposing, and agree with Yeb Havinga in another email who suggests we spell out in documentation some recipes for achieving various possible scenarios given whatever GUCs we settle on. > "availability" mode with two standbys might create a sort of similar situation. > That is, since the ACK from the near standby arrives in first, the near standby > acts synchronous and the remote one does asynchronous. But the ACK from the > remote standby can arrive in first, so it's not guaranteed that the near standby > has received the log info before transaction commit returns a "success" to the > client. In this case, we have to failover to the remote standby even if it's not > under control of a clusterware. This is a problem for me. My concern is that in a quorum system, if the quorum number is less than the total number of replicas, there's no way to know *which* replicas composed the quorum for any given transaction, so we can't know which servers to fail to if the master dies. This isn't different from Oracle, where it looks like essentially the "quorum" value is always 1. Your scenario shows that all replicas are not created equal, and that sometimes we'll be interested in WAL getting committed on a specific subset of the available servers. If I had two nearby replicas called X and Y, and one at a remote site called Z, for instance, I'd set quorum to 2, but really I'd want to say "wait for server X and Y before committing, but don't worry about Z". I have no idea how to set up our GUCs to encode a situation like that :) -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
On Tue, Jul 27, 2010 at 8:48 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Is there a reason not to send the signal in XlogFlush itself, so it would be > called at > > CreateCheckPoint(), EndPrepare(), FlushBuffer(), > RecordTransactionAbortPrepared(), RecordTransactionCommit(), > RecordTransactionCommitPrepared(), RelationTruncate(), > SlruPhysicalWritePage(), write_relmap_file(), WriteTruncateXlogRec(), and > xact_redo_commit(). Yes, it's because there is no need to send WAL immediately in other than the following functions: * EndPrepare() * RecordTransactionAbortPrepared() * RecordTransactionCommit() * RecordTransactionCommitPrepared() Some functions call XLogFlush() to follow the basic WAL rule. In the standby, WAL records are always flushed to disk prior to any corresponding data-file change. So, we don't need to replicate the result of XLogFlush() immediately for the WAL rule. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jul 27, 2010 at 10:12 PM, Joshua Tolley <eggyknap@gmail.com> wrote: > I don't think it can support the case you're interested in, though I'm not > terribly expert on it. I'm definitely not arguing for the syntax Oracle uses, > or something similar; I much prefer the flexibility we're proposing, and agree > with Yeb Havinga in another email who suggests we spell out in documentation > some recipes for achieving various possible scenarios given whatever GUCs we > settle on. Agreed. I'll add it to my TODO list. > My concern is that in a quorum system, if the quorum number is less than the > total number of replicas, there's no way to know *which* replicas composed the > quorum for any given transaction, so we can't know which servers to fail to if > the master dies. What about checking the current WAL receive location of each standby by using pg_last_xlog_receive_location()? The standby which has the newest location should be failed over to. > This isn't different from Oracle, where it looks like > essentially the "quorum" value is always 1. Your scenario shows that all > replicas are not created equal, and that sometimes we'll be interested in WAL > getting committed on a specific subset of the available servers. If I had two > nearby replicas called X and Y, and one at a remote site called Z, for > instance, I'd set quorum to 2, but really I'd want to say "wait for server X > and Y before committing, but don't worry about Z". > > I have no idea how to set up our GUCs to encode a situation like that :) Yeah, quorum commit alone cannot cover that situation. I think that current approach (i.e., quorum commit plus replication mode per standby) would cover that. In your example, you can choose "recv", "fsync" or "replay" as replication_mode in X and Y, and choose "async" in Z. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jul 27, 2010 at 10:53:45PM +0900, Fujii Masao wrote: > On Tue, Jul 27, 2010 at 10:12 PM, Joshua Tolley <eggyknap@gmail.com> wrote: > > My concern is that in a quorum system, if the quorum number is less than the > > total number of replicas, there's no way to know *which* replicas composed the > > quorum for any given transaction, so we can't know which servers to fail to if > > the master dies. > > What about checking the current WAL receive location of each standby by > using pg_last_xlog_receive_location()? The standby which has the newest > location should be failed over to. That makes sense. Thanks. > > This isn't different from Oracle, where it looks like > > essentially the "quorum" value is always 1. Your scenario shows that all > > replicas are not created equal, and that sometimes we'll be interested in WAL > > getting committed on a specific subset of the available servers. If I had two > > nearby replicas called X and Y, and one at a remote site called Z, for > > instance, I'd set quorum to 2, but really I'd want to say "wait for server X > > and Y before committing, but don't worry about Z". > > > > I have no idea how to set up our GUCs to encode a situation like that :) > > Yeah, quorum commit alone cannot cover that situation. I think that > current approach (i.e., quorum commit plus replication mode per standby) > would cover that. In your example, you can choose "recv", "fsync" or > "replay" as replication_mode in X and Y, and choose "async" in Z. Clearly I need to read through the GUCs and docs better. I'll try to keep quiet until that's finished :) -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Le 27 juil. 2010 à 15:12, Joshua Tolley <eggyknap@gmail.com> a écrit : > My concern is that in a quorum system, if the quorum number is less than the > total number of replicas, there's no way to know *which* replicas composed the > quorum for any given transaction, so we can't know which servers to fail to if > the master dies. This isn't different from Oracle, where it looks like > essentially the "quorum" value is always 1. Your scenario shows that all > replicas are not created equal, and that sometimes we'll be interested in WAL > getting committed on a specific subset of the available servers. If I had two > nearby replicas called X and Y, and one at a remote site called Z, for > instance, I'd set quorum to 2, but really I'd want to say "wait for server X > and Y before committing, but don't worry about Z". > > I have no idea how to set up our GUCs to encode a situation like that :) You make it so that Z does not take a vote, by setting it async. Regards, -- dim
On 27/07/10 16:12, Joshua Tolley wrote: > My concern is that in a quorum system, if the quorum number is less than the > total number of replicas, there's no way to know *which* replicas composed the > quorum for any given transaction, so we can't know which servers to fail to if > the master dies. In fact, it's possible for one standby to sync up to X, then disconnect and reconnect, and have the master count it second time in the quorum. Especially if the master doesn't notice that the standby disconnected, e.g a network problem. I don't think any of this quorum stuff makes much sense without explicitly registering standbys in the master. That would also solve the fuzziness with wal_keep_segments - if the master knew what standbys exist, it could keep track of how far each standby has received WAL, and keep just enough WAL for each standby to catch up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > In fact, it's possible for one standby to sync up to X, then disconnect and > reconnect, and have the master count it second time in the quorum. > Especially if the master doesn't notice that the standby disconnected, e.g a > network problem. > > I don't think any of this quorum stuff makes much sense without explicitly > registering standbys in the master. This doesn't have to be done manually. The streaming protocol could include the standby sending its system id to the master. The master could just keep a list of system ids with the last record they've been sent and the last they've confirmed receipt, fsync, application, whatever the protocol covers. If the same system reconnects it just overwrites the existing data for that system id. -- greg
On Sun, Aug 1, 2010 at 8:30 AM, Greg Stark <gsstark@mit.edu> wrote: > On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> In fact, it's possible for one standby to sync up to X, then disconnect and >> reconnect, and have the master count it second time in the quorum. >> Especially if the master doesn't notice that the standby disconnected, e.g a >> network problem. >> >> I don't think any of this quorum stuff makes much sense without explicitly >> registering standbys in the master. > > This doesn't have to be done manually. The streaming protocol could > include the standby sending its system id to the master. The master > could just keep a list of system ids with the last record they've been > sent and the last they've confirmed receipt, fsync, application, > whatever the protocol covers. If the same system reconnects it just > overwrites the existing data for that system id. That seems entirely too clever. Where are you going to store this data? What if you want to clean out the list? I've felt from the beginning that the idea of doing synchronous replication without having an explicit notion of what standbys are out there was not on very sound footing, and I think the difficulties of making quorum commit work properly are only further evidence of that. Much has been made of the notion of "wait for N votes, but allow standbys to explicitly give up their vote", but that's still not fully general - for example, you can't implement A && (B || C). Perhaps someone will claim that nobody wants to do that anyway (which I don't believe, BTW), but even in simpler cases it would be nicer to have an explicit policy rather than - in effect - inferring a policy from a soup of GUC settings. For example, if you want one synchronous standby (A) and two asynchronous standbys (B and C). You can say quorum=1 on the master and then configure vote=1 on A and vote=0 on B and C, but now you have to look at four machines to figure out what the policy is, and a change on any one of those machines can break it.ISTM that if you can just write synchronous_standbys=Aon the master, that's a whole lot more clear and less error-prone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sun, Aug 1, 2010 at 9:30 PM, Greg Stark <gsstark@mit.edu> wrote: > This doesn't have to be done manually. Agreed, if we register standbys in the master. > The streaming protocol could > include the standby sending its system id to the master. The master > could just keep a list of system ids with the last record they've been > sent and the last they've confirmed receipt, fsync, application, > whatever the protocol covers. If the same system reconnects it just > overwrites the existing data for that system id. Since every standby has the same system id, we cannot distinguish them by that id. ISTM that the master should assign the unique id for each standby, and they should save it in pg_control. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Aug 1, 2010 at 10:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Aug 1, 2010 at 9:30 PM, Greg Stark <gsstark@mit.edu> wrote: >> This doesn't have to be done manually. > > Agreed, if we register standbys in the master. > >> The streaming protocol could >> include the standby sending its system id to the master. The master >> could just keep a list of system ids with the last record they've been >> sent and the last they've confirmed receipt, fsync, application, >> whatever the protocol covers. If the same system reconnects it just >> overwrites the existing data for that system id. > > Since every standby has the same system id, we cannot distinguish > them by that id. ISTM that the master should assign the unique id > for each standby, and they should save it in pg_control. Another option might be to let the user name them. standby_name='near' standby_name='far1' standby_name='far2' ...or whatever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sun, Aug 1, 2010 at 3:11 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I don't think any of this quorum stuff makes much sense without explicitly > registering standbys in the master. I'm not sure if this is a good idea. This requires users to do more manual operations than ever when setting up the replication; assign unique name (or ID) to each standby, register them in the master, specify the names in each recovery.conf (or elsewhere), and remove the registration from the master when getting rid of the standby. But this is similar to the way of MySQL replication setup, so some people (excluding me) may be familiar with it. > That would also solve the fuzziness with wal_keep_segments - if the master > knew what standbys exist, it could keep track of how far each standby has > received WAL, and keep just enough WAL for each standby to catch up. What if the registered standby stays down for a long time? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Aug 1, 2010 at 9:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Perhaps someone will claim that nobody wants to do that anyway (which > I don't believe, BTW), but even in simpler cases it would be nicer to > have an explicit policy rather than - in effect - inferring a policy > from a soup of GUC settings. For example, if you want one synchronous > standby (A) and two asynchronous standbys (B and C). You can say > quorum=1 on the master and then configure vote=1 on A and vote=0 on B > and C, but now you have to look at four machines to figure out what > the policy is, and a change on any one of those machines can break it. > ISTM that if you can just write synchronous_standbys=A on the master, > that's a whole lot more clear and less error-prone. Some standbys may become master later by failover. So we would need to write something like synchronous_standbys=A on not only current one master but also those standbys. Changing synchronous_standbys would require change on all those servers. Or the master should replicate even that change to the standbys? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Aug 2, 2010 at 5:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Aug 1, 2010 at 9:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Perhaps someone will claim that nobody wants to do that anyway (which >> I don't believe, BTW), but even in simpler cases it would be nicer to >> have an explicit policy rather than - in effect - inferring a policy >> from a soup of GUC settings. For example, if you want one synchronous >> standby (A) and two asynchronous standbys (B and C). You can say >> quorum=1 on the master and then configure vote=1 on A and vote=0 on B >> and C, but now you have to look at four machines to figure out what >> the policy is, and a change on any one of those machines can break it. >> ISTM that if you can just write synchronous_standbys=A on the master, >> that's a whole lot more clear and less error-prone. > > Some standbys may become master later by failover. So we would > need to write something like synchronous_standbys=A on not only > current one master but also those standbys. Changing > synchronous_standbys would require change on all those servers. > Or the master should replicate even that change to the standbys? Let's not get *the manner of specifying the policy* confused with *the need to update the policy when the master changes*. It doesn't seem likely you would want the same value for synchronous_standbys on all your machines. In the most common configuration, you'd probably have: on A: synchronous_standbys=B on B: synchronous_standbys=A -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Let's not get *the manner of specifying the policy* confused with *the > need to update the policy when the master changes*. It doesn't seem > likely you would want the same value for synchronous_standbys on all > your machines. In the most common configuration, you'd probably have: > > on A: synchronous_standbys=B > on B: synchronous_standbys=A Oh, true. But, what if we have another synchronous standby called C? We specify the policy as follows?: on A: synchronous_standbys=B,C on B: synchronous_standbys=A,C on C: synchronous_standbys=A,B We would need to change the setting on both A and B when we want to change the name of the third standby from C to D, for example. No? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Aug 2, 2010 at 7:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Let's not get *the manner of specifying the policy* confused with *the >> need to update the policy when the master changes*. It doesn't seem >> likely you would want the same value for synchronous_standbys on all >> your machines. In the most common configuration, you'd probably have: >> >> on A: synchronous_standbys=B >> on B: synchronous_standbys=A > > Oh, true. But, what if we have another synchronous standby called C? > We specify the policy as follows?: > > on A: synchronous_standbys=B,C > on B: synchronous_standbys=A,C > on C: synchronous_standbys=A,B > > We would need to change the setting on both A and B when we want to > change the name of the third standby from C to D, for example. No? Sure. If you give the standbys names, then if people change the names, they'll have to update their configuration. But I can't see that as an argument against doing it. You can remove the possibility that someone will have a hassle if they rename a server by not allowing them to give it a name in the first place, but that doesn't seem like a win from a usability perspective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 2, 2010 at 8:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Sure. If you give the standbys names, then if people change the > names, they'll have to update their configuration. But I can't see > that as an argument against doing it. You can remove the possibility > that someone will have a hassle if they rename a server by not > allowing them to give it a name in the first place, but that doesn't > seem like a win from a usability perspective. I'm just comparing your idea (i.e., set synchronous_standbys on each possible master) with my idea (i.e., set replication_mode on each standby). Though your idea has the advantage described in the following post, it seems to make the setup of the standbys more complicated, as I described. So I'm trying to generate better idea. http://archives.postgresql.org/pgsql-hackers/2010-08/msg00007.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> Let's not get *the manner of specifying the policy* confused with *the >> need to update the policy when the master changes*. It doesn't seem >> likely you would want the same value for synchronous_standbys on all >> your machines. In the most common configuration, you'd probably have: >> >> on A: synchronous_standbys=B >> on B: synchronous_standbys=A >> > > Oh, true. But, what if we have another synchronous standby called C? > We specify the policy as follows?: > > on A: synchronous_standbys=B,C > on B: synchronous_standbys=A,C > on C: synchronous_standbys=A,B > > We would need to change the setting on both A and B when we want to > change the name of the third standby from C to D, for example. No? > What if the master is named as well in the 'pool of servers that are in sync'? In the scenario above this pool would be A,B,C. Working with this concept has as benefit that the setting can be copied to all other servers as well, and is invariant under any number of failures or switchovers. The same could also hold for quorum expressions like A && (B || C), if A,B,C are either master or standby. I initially though that once the definitions could be the same on all servers, having them in a system catalog would be a good thing. However that'd propably hard to setup, and also in the case of failures during change of the parameters it could become very messy. regards, Yeb Havinga
On Mon, Aug 2, 2010 at 8:57 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Fujii Masao wrote: >> >> On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >>> >>> Let's not get *the manner of specifying the policy* confused with *the >>> need to update the policy when the master changes*. It doesn't seem >>> likely you would want the same value for synchronous_standbys on all >>> your machines. In the most common configuration, you'd probably have: >>> >>> on A: synchronous_standbys=B >>> on B: synchronous_standbys=A >>> >> >> Oh, true. But, what if we have another synchronous standby called C? >> We specify the policy as follows?: >> >> on A: synchronous_standbys=B,C >> on B: synchronous_standbys=A,C >> on C: synchronous_standbys=A,B >> >> We would need to change the setting on both A and B when we want to >> change the name of the third standby from C to D, for example. No? >> > > What if the master is named as well in the 'pool of servers that are in > sync'? In the scenario above this pool would be A,B,C. Working with this > concept has as benefit that the setting can be copied to all other servers > as well, and is invariant under any number of failures or switchovers. The > same could also hold for quorum expressions like A && (B || C), if A,B,C are > either master or standby. > > I initially though that once the definitions could be the same on all > servers, having them in a system catalog would be a good thing. However > that'd propably hard to setup, and also in the case of failures during > change of the parameters it could become very messy. Yeah, I think this information has to be stored either in GUCs or in a flat-file somewhere. Putting it in a system catalog will cause major problems when trying to get a down system back up, I think. I suspect that for complex setups, people will need to use some kind of cluster-ware to update the settings as nodes go up and down. But I think it will still be simpler if the nodes are named. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 27/07/10 13:29, Fujii Masao wrote: > On Tue, Jul 27, 2010 at 7:39 PM, Yeb Havinga<yebhavinga@gmail.com> wrote: >> Fujii Masao wrote: >> I noted the changes in XlogSend where instead of *caughtup = true/false it >> now returns !MyWalSnd->sndrqst. That value is initialized to false in that >> procedure and it cannot be changed to true during execution of that >> procedure, or can it? > > That value is set to true in WalSndWakeup(). If WalSndWakeup() is called > after initialization of that value in XLogSend(), *caughtup is set to false. There's some race conditions with the signaling. If another process finishes XLOG flush and sends the signal when a walsender has just finished one iteration of its main loop, walsender will reset xlogsend_requested and go to sleep. It should not sleep but send the pending WAL immediately. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 02/08/10 11:45, Fujii Masao wrote: > On Sun, Aug 1, 2010 at 3:11 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I don't think any of this quorum stuff makes much sense without explicitly >> registering standbys in the master. > > I'm not sure if this is a good idea. This requires users to do more > manual operations than ever when setting up the replication; assign > unique name (or ID) to each standby, register them in the master, > specify the names in each recovery.conf (or elsewhere), and remove > the registration from the master when getting rid of the standby. > > But this is similar to the way of MySQL replication setup, so some > people (excluding me) may be familiar with it. > >> That would also solve the fuzziness with wal_keep_segments - if the master >> knew what standbys exist, it could keep track of how far each standby has >> received WAL, and keep just enough WAL for each standby to catch up. > > What if the registered standby stays down for a long time? Then you risk running out of disk space. Similar to having an archive command that fails for some reason. That's one reason the registration should not be too automatic - there is serious repercussions if the standby just disappears. If the standby is a synchronous one, the master will stop committing or delay acknowledging commits, depending on the configuration, and the master needs to keep extra WAL around. Of course, we can still support unregistered standbys, with the current semantics. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Aug 4, 2010 at 12:35 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > There's some race conditions with the signaling. If another process finishes > XLOG flush and sends the signal when a walsender has just finished one > iteration of its main loop, walsender will reset xlogsend_requested and go > to sleep. It should not sleep but send the pending WAL immediately. Yep. To avoid that race condition, xlogsend_requested should be reset to false after sleep and before calling XLogSend(). I attached the updated version of the patch. Of course, the code is also available in my git repository: git://git.postgresql.org/git/users/fujii/postgres.git branch: wakeup-walsnd Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, Aug 4, 2010 at 10:38 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Then you risk running out of disk space. Similar to having an archive > command that fails for some reason. > > That's one reason the registration should not be too automatic - there is > serious repercussions if the standby just disappears. If the standby is a > synchronous one, the master will stop committing or delay acknowledging > commits, depending on the configuration, and the master needs to keep extra > WAL around. Umm... in addition to registration of each standby, I think we should allow users to set the upper limit of the number of WAL files kept in pg_xlog to avoid running out of disk space. If it exceeds the upper limit, the master disconnects too old standbys from the cluster and removes all the WAL files not required for current connected standbys. If you don't want any standby to disappear unexpectedly because of the upper limit, you can set it to 0 (= no limit). I'm thinking to make users register and unregister each standbys via SQL functions like register_standby() and unregister_standby(): void register_standby(standby_name text, streaming_start_lsn text) void unregister_standby(standby_name text) Note that standby_name should be specified in recovery.conf of each standby. By using them we can easily specify which WAL files are unremovable because of new standby when taking the base backup for it as follows: SELECT register_standby('foo', pg_start_backup()) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 05/08/10 17:14, Fujii Masao wrote: > I'm thinking to make users register and unregister each standbys via SQL > functions like register_standby() and unregister_standby(): The register/unregister facility should be accessible from the streaming replication connection, so that you don't need to connect to any particular database in addition to the streaming connection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 01/08/10 15:30, Greg Stark wrote: > On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I don't think any of this quorum stuff makes much sense without explicitly >> registering standbys in the master. > > This doesn't have to be done manually. The streaming protocol could > include the standby sending its system id to the master. The master > could just keep a list of system ids with the last record they've been > sent and the last they've confirmed receipt, fsync, application, > whatever the protocol covers. If the same system reconnects it just > overwrites the existing data for that system id. Systemid doesn't work for that. Systemid is assigned at initdb time, so all the standbys have the same systemid as the master. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I wonder if we can continue to rely on the pg_sleep() loop for sleeping in walsender. On those platforms where interrupts don't interrupt sleep, sending the signal is not going to promptly wake up walsender. That was fine before, but any delay is going to be poison to synchronous replication performance. Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao wrote: > On Wed, Aug 4, 2010 at 10:38 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > Then you risk running out of disk space. Similar to having an archive > > command that fails for some reason. > > > > That's one reason the registration should not be too automatic - there is > > serious repercussions if the standby just disappears. If the standby is a > > synchronous one, the master will stop committing or delay acknowledging > > commits, depending on the configuration, and the master needs to keep extra > > WAL around. > > Umm... in addition to registration of each standby, I think we should allow > users to set the upper limit of the number of WAL files kept in pg_xlog to > avoid running out of disk space. If it exceeds the upper limit, the master > disconnects too old standbys from the cluster and removes all the WAL files > not required for current connected standbys. If you don't want any standby > to disappear unexpectedly because of the upper limit, you can set it to 0 > (= no limit). > > I'm thinking to make users register and unregister each standbys via SQL > functions like register_standby() and unregister_standby(): > > void register_standby(standby_name text, streaming_start_lsn text) > void unregister_standby(standby_name text) > > Note that standby_name should be specified in recovery.conf of each > standby. > > By using them we can easily specify which WAL files are unremovable because > of new standby when taking the base backup for it as follows: > > SELECT register_standby('foo', pg_start_backup()) I know there has been discussion about how to identify the standby servers --- how about using the connection application_name in recovery.conf: primary_conninfo = 'host=localhost port=5432 application_name=slave1' The good part is that once recovery.conf goes away because it isn't a standby anymore, the the application_name is gone. An even more interesting approach would be to specify the replication mode in the application_name: primary_conninfo = 'host=localhost port=5432 application_name=replay' and imagine being able to view the status of standby servers from pg_stat_activity. (Right now standby servers do not appear in pg_stat_activity.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 05/08/10 13:40, Fujii Masao wrote: > On Wed, Aug 4, 2010 at 12:35 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> There's some race conditions with the signaling. If another process finishes >> XLOG flush and sends the signal when a walsender has just finished one >> iteration of its main loop, walsender will reset xlogsend_requested and go >> to sleep. It should not sleep but send the pending WAL immediately. > > Yep. To avoid that race condition, xlogsend_requested should be reset to > false after sleep and before calling XLogSend(). I attached the updated > version of the patch. There's still a small race condition: if you receive the signal just before entering pg_usleep(), it will not be interrupted. Of course, on platforms where signals don't interrupt sleep, the problem is even bigger. Magnus reminded me that we can use select() instead of pg_usleep() on such platforms, but that's still vulnerable to the race condition. ppoll() or pselect() could be used, but I don't think they're fully portable. I think we'll have to resort to the self-pipe trick mentioned in the Linux select(3) man page: > On systems that lack pselect(), reliable (and > more portable) signal trapping can be achieved using the self-pipe > trick (where a signal handler writes a byte to a pipe whose other end > is monitored by select() in the main program.) Another idea is to use something different than Unix signals, like ProcSendSignal/ProcWaitForSignal which are implemented using semaphores. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 15, 2010 at 8:39 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > I rebased the patch against current HEAD because it conflicted with > recent commits about a latch. Can you please rebase this again? It no longer applies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company