Thread: Issues with Quorum Commit
All, There's been a lot of discussion on synch rep lately which involves quorum commit. I need to raise some major design issues with quorum commit which I don't think that people have really considered, and may be sufficient to prevent it from being included in 9.1. A. Permanent Synchronization Failure --------------------------------- Quorum commit, like other forms of more-than-one-standby synch rep, offers the possibility that one or more standbys could end up irretrievably desyncronized with the master. 1. Quorum is 3 servers (out of 5) with mode "apply" 2. Standbys 2 and 4 receive and apply transaction # 20001. 3. Due to a network issue, no other standby applies #20001. 4. Accordingly, the master rolls back #20001 and cancels, either due to timeout or DBA cancel. 5. #2 and #5 are now hopelessly out of synch with the master. B. Eventual Inconsistency ------------------------- If we have a quorum commit, it's possible for any individual standby to be indefinitely ahead of any standby which is not needed by the quorum.This means that: -- There is no clear criteria for when a standby which is not needed for quorum should be considered no longer a synch standby, and -- Applications cannot make assumptions that synch rep promises some specific window of synchronicity, eliminating a lot of the value of quorum commit. C. Performance -------------- Doing quorum commit requires significant extra accounting on the master's part: it must keep track of how many standbys committed for each pending transaction (and remember there may be many at the same time). Doing so could involve significant response-time overhead added to the simple case where there is only one standby, as well as memory usage, and likely a lot of troubleshooting of the mechanism from us. D. Adding/Replacing Quorum Members ---------------------------------- For Quorum commit to be really valuable, we need to be able to add new quorum members and remove dead ones *without stopping the master*. Per discussion about the startup issues with only one master, we have not worked out how to do this for synch rep standbys. It's reasonable to assume that this will be more complex for a quorum group than with a single synch standby. Consider the case, for example, where due to a network outage we have dropped below quorum. What is the strategy for getting the system running again by adding standbys? All of the problems above are resolvable. Some of the CAP databases have probably resolved them, as well as some older telecom databases. However, all of them will require significant work, and even more significant debugging, from the project. I would like to see Quorum Commit, in part because I think it would help push PostgreSQL further into cloud frameworks. However, I'm worried that if we make quorum commit a requirement of synch rep, we will not have synch rep in 9.1. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 05.10.2010 22:11, Josh Berkus wrote: > There's been a lot of discussion on synch rep lately which involves > quorum commit. I need to raise some major design issues with quorum > commit which I don't think that people have really considered, and may > be sufficient to prevent it from being included in 9.1. Thanks for bringing these up. > A. Permanent Synchronization Failure > --------------------------------- > Quorum commit, like other forms of more-than-one-standby synch rep, > offers the possibility that one or more standbys could end up > irretrievably desyncronized with the master. > > 1. Quorum is 3 servers (out of 5) with mode "apply" > 2. Standbys 2 and 4 receive and apply transaction # 20001. > 3. Due to a network issue, no other standby applies #20001. > 4. Accordingly, the master rolls back #20001 and cancels, either due to > timeout or DBA cancel. The master can not roll back or cancel the transaction. That's completely infeasible, the WAL record has been written to local disk already. The best it can do is halt and wait for enough standbys to appear to fulfill the quorum. The client will hang waiting for the COMMIT to finish, and the transaction will appear as in-progress to other transactions. There's subtle point here that I don't think has been discussed yet: If the master is forcibly restarted at that point, with pg_ctl restart -m immediate, strictly speaking the master should start up in the same state, with the unlucky transaction still appearing as in-progress, until the standby acknowledges. > 5. #2 and #5 are now hopelessly out of synch with the master. > B. Eventual Inconsistency > ------------------------- > If we have a quorum commit, it's possible for any individual standby to > be indefinitely ahead of any standby which is not needed by the quorum. > This means that: > > -- There is no clear criteria for when a standby which is not needed for > quorum should be considered no longer a synch standby, and > -- Applications cannot make assumptions that synch rep promises some > specific window of synchronicity, eliminating a lot of the value of > quorum commit. Yep. > C. Performance > -------------- > Doing quorum commit requires significant extra accounting on the > master's part: it must keep track of how many standbys committed for > each pending transaction (and remember there may be many at the same > time). > > Doing so could involve significant response-time overhead added to the > simple case where there is only one standby, as well as memory usage, > and likely a lot of troubleshooting of the mechanism from us. My gut feeling is that overhead will pale to insignificance compared to the network and other overheads of actually getting the WAL to the standby and processing the acknowledgments. > D. Adding/Replacing Quorum Members > ---------------------------------- > For Quorum commit to be really valuable, we need to be able to add new > quorum members and remove dead ones *without stopping the master*. Per > discussion about the startup issues with only one master, we have not > worked out how to do this for synch rep standbys. It's reasonable to > assume that this will be more complex for a quorum group than with a > single synch standby. > > Consider the case, for example, where due to a network outage we have > dropped below quorum. What is the strategy for getting the system > running again by adding standbys? You start a new one from the latest base backup and let it catch up? Possibly modifying the config file in the master to let it know about the new standby, if we go down that path. This part doesn't seem particularly hard to me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki, > The master can not roll back or cancel the transaction. That's > completely infeasible, the WAL record has been written to local disk > already. The best it can do is halt and wait for enough standbys to > appear to fulfill the quorum. The client will hang waiting for the > COMMIT to finish, and the transaction will appear as in-progress to > other transactions. Ohhh. Good point. So there's no real point in a timeout setting for quorum commit; it's always "wait forever". So, this is a critical issue with "wait forever" even with one server. > There's subtle point here that I don't think has been discussed yet: If > the master is forcibly restarted at that point, with pg_ctl restart -m > immediate, strictly speaking the master should start up in the same > state, with the unlucky transaction still appearing as in-progress, > until the standby acknowledges. Yeah. That makes the ability to issue a command which says "drop all synch rep and commit whatever's pending" to be critical. However, this makes for, in some ways, a worse situation: if you fail to achieve quorum on any commit, then you need to rebuild your entire quorum pool from scratch. > You start a new one from the latest base backup and let it catch up? > Possibly modifying the config file in the master to let it know about > the new standby, if we go down that path. This part doesn't seem > particularly hard to me. Yeah? How do you modify the config file and get the master to consider the new server to be part of the quorum pool *without restarting the master*? Again, I'm just saying that merely doing single-server synch rep, *and* making HS/SR easier to admin in general, is going to be a big task for 9.1. Quorum Commit needs to be considered a separate feature, and one which is dispensible for 9.1. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote: > B. Eventual Inconsistency > ------------------------- > If we have a quorum commit, it's possible for any individual standby to > be indefinitely ahead of any standby which is not needed by the quorum. > This means that: > > -- There is no clear criteria for when a standby which is not needed for > quorum should be considered no longer a synch standby, and > -- Applications cannot make assumptions that synch rep promises some > specific window of synchronicity, eliminating a lot of the value of > quorum commit. Point B seems particularly dangerous. When you lose one of the systems and the lagging server becomes required for quorum, then all of a sudden you could be facing a huge delay to commit the next transaction (because it needs to catch up on a lot of WAL replay). This can happen even without a network problem at all, and seems very likely to result in the lagging system being considered "down" due to a timeout. Not good, because the reason it is required for quorum is because another standby just went down. In other words, a lagging standby combined with a timeout mechanism is essentially useless, because it will never catch up in time to be a part of the quorum. Regards,Jeff Davis
On Tue, 2010-10-05 at 22:32 +0300, Heikki Linnakangas wrote: > On 05.10.2010 22:11, Josh Berkus wrote: > > There's been a lot of discussion on synch rep lately which involves > > quorum commit. I need to raise some major design issues with quorum > > commit which I don't think that people have really considered, and may > > be sufficient to prevent it from being included in 9.1. > > Thanks for bringing these up. Yes, I'm very happy to discuss these. The points appear to be directed at "quorum commit", which is a name I've used. But most of the points apply more to Fujii's patch than my own. I can only presume that Josh wants to prevent us from adopting a design that allows sync against multiple standbys. > > A. Permanent Synchronization Failure > > --------------------------------- > > Quorum commit, like other forms of more-than-one-standby synch rep, > > offers the possibility that one or more standbys could end up > > irretrievably desyncronized with the master. > > > > 1. Quorum is 3 servers (out of 5) with mode "apply" > > 2. Standbys 2 and 4 receive and apply transaction # 20001. > > 3. Due to a network issue, no other standby applies #20001. > > 4. Accordingly, the master rolls back #20001 and cancels, either due to > > timeout or DBA cancel. > > The master can not roll back or cancel the transaction. That's > completely infeasible, the WAL record has been written to local disk > already. The best it can do is halt and wait for enough standbys to > appear to fulfill the quorum. The client will hang waiting for the > COMMIT to finish, and the transaction will appear as in-progress to > other transactions. Yes, that point has long been understood. Neither patch does this, and in fact the issue is a completely general one. > There's subtle point here that I don't think has been discussed yet: If > the master is forcibly restarted at that point, with pg_ctl restart -m > immediate, strictly speaking the master should start up in the same > state, with the unlucky transaction still appearing as in-progress, > until the standby acknowledges. That is a very important point, but again, nothing to do with quorum commit. For strict correctness, we should do that. Are you suggesting we should do that here? > > 5. #2 and #5 are now hopelessly out of synch with the master. > > > B. Eventual Inconsistency > > ------------------------- > > If we have a quorum commit, it's possible for any individual standby to > > be indefinitely ahead of any standby which is not needed by the quorum. > > This means that: > > > > -- There is no clear criteria for when a standby which is not needed for > > quorum should be considered no longer a synch standby, and > > -- Applications cannot make assumptions that synch rep promises some > > specific window of synchronicity, eliminating a lot of the value of > > quorum commit. > > Yep. Could the person that wrote that actually explain what a "specific window of synchronicity" is? I'm not sure whether to agree, or disagree. > > C. Performance > > -------------- > > Doing quorum commit requires significant extra accounting on the > > master's part: it must keep track of how many standbys committed for > > each pending transaction (and remember there may be many at the same > > time). > > > > Doing so could involve significant response-time overhead added to the > > simple case where there is only one standby, as well as memory usage, > > and likely a lot of troubleshooting of the mechanism from us. > > My gut feeling is that overhead will pale to insignificance compared to > the network and other overheads of actually getting the WAL to the > standby and processing the acknowledgments. You're ignoring Josh's points. Those exact points have been made by me in support of the design of my patch and against Fujii's. The mechanism to do this will be more complex and more likely to break. And it will be slower and that is a concern for me. > > D. Adding/Replacing Quorum Members > > ---------------------------------- > > For Quorum commit to be really valuable, we need to be able to add new > > quorum members and remove dead ones *without stopping the master*. Per > > discussion about the startup issues with only one master, we have not > > worked out how to do this for synch rep standbys. It's reasonable to > > assume that this will be more complex for a quorum group than with a > > single synch standby. > > > > Consider the case, for example, where due to a network outage we have > > dropped below quorum. What is the strategy for getting the system > > running again by adding standbys? > > You start a new one from the latest base backup and let it catch up? > Possibly modifying the config file in the master to let it know about > the new standby, if we go down that path. This part doesn't seem > particularly hard to me. Agreed, not sure of the issue there. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Tue, 2010-10-05 at 13:45 -0700, Jeff Davis wrote: > On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote: > > B. Eventual Inconsistency > > ------------------------- > > If we have a quorum commit, it's possible for any individual standby to > > be indefinitely ahead of any standby which is not needed by the quorum. > > This means that: > > > > -- There is no clear criteria for when a standby which is not needed for > > quorum should be considered no longer a synch standby, and > > -- Applications cannot make assumptions that synch rep promises some > > specific window of synchronicity, eliminating a lot of the value of > > quorum commit. > > Point B seems particularly dangerous. > > When you lose one of the systems and the lagging server becomes required > for quorum, then all of a sudden you could be facing a huge delay to > commit the next transaction (because it needs to catch up on a lot of > WAL replay). This can happen even without a network problem at all, and > seems very likely to result in the lagging system being considered > "down" due to a timeout. Not good, because the reason it is required for > quorum is because another standby just went down. > > In other words, a lagging standby combined with a timeout mechanism is > essentially useless, because it will never catch up in time to be a part > of the quorum. Thanks for explaining what was meant. This issue is a serious problem with the apply to *all* servers that Heikki has been describing as being a useful use case. We register a standby, it goes down and we decide to wait for it. Then when it does come back up it takes ages to catch up. This is really the nail in the coffin for the "All" servers use case, and a significant blow to the requirement for standby registration. If we use N+1 redundancy as I have explained, then this situation does not occur until you have less than N standbys available. But then it's no surprise that RAID-5 won't work with 4 drives either. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Tue, Oct 5, 2010 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The points appear to be directed at "quorum commit", which is a name > I've used. But most of the points apply more to Fujii's patch than my > own. I can only presume that Josh wants to prevent us from adopting a > design that allows sync against multiple standbys. This looks to me like a cheap shot that doesn't advance the discussion. You are the first to complain when people don't take your ideas as seriously as you feel they should. >> > A. Permanent Synchronization Failure >> > --------------------------------- >> > Quorum commit, like other forms of more-than-one-standby synch rep, >> > offers the possibility that one or more standbys could end up >> > irretrievably desyncronized with the master. >> > >> > 1. Quorum is 3 servers (out of 5) with mode "apply" >> > 2. Standbys 2 and 4 receive and apply transaction # 20001. >> > 3. Due to a network issue, no other standby applies #20001. >> > 4. Accordingly, the master rolls back #20001 and cancels, either due to >> > timeout or DBA cancel. >> >> The master can not roll back or cancel the transaction. That's >> completely infeasible, the WAL record has been written to local disk >> already. The best it can do is halt and wait for enough standbys to >> appear to fulfill the quorum. The client will hang waiting for the >> COMMIT to finish, and the transaction will appear as in-progress to >> other transactions. > > Yes, that point has long been understood. Neither patch does this, and > in fact the issue is a completely general one. Yep. >> There's subtle point here that I don't think has been discussed yet: If >> the master is forcibly restarted at that point, with pg_ctl restart -m >> immediate, strictly speaking the master should start up in the same >> state, with the unlucky transaction still appearing as in-progress, >> until the standby acknowledges. > > That is a very important point, but again, nothing to do with quorum > commit. For strict correctness, we should do that. Are you suggesting we > should do that here? I agree that this has nothing to do with quorum commit. It does have to do with synchronous replication, but I'm skeptical that we want to get into it for this release, if ever. >> > 5. #2 and #5 are now hopelessly out of synch with the master. >> >> > B. Eventual Inconsistency >> > ------------------------- >> > If we have a quorum commit, it's possible for any individual standby to >> > be indefinitely ahead of any standby which is not needed by the quorum. >> > This means that: >> > >> > -- There is no clear criteria for when a standby which is not needed for >> > quorum should be considered no longer a synch standby, and >> > -- Applications cannot make assumptions that synch rep promises some >> > specific window of synchronicity, eliminating a lot of the value of >> > quorum commit. >> >> Yep. > > Could the person that wrote that actually explain what a "specific > window of synchronicity" is? I'm not sure whether to agree, or disagree. Me either. >> > C. Performance >> > -------------- >> > Doing quorum commit requires significant extra accounting on the >> > master's part: it must keep track of how many standbys committed for >> > each pending transaction (and remember there may be many at the same >> > time). >> > >> > Doing so could involve significant response-time overhead added to the >> > simple case where there is only one standby, as well as memory usage, >> > and likely a lot of troubleshooting of the mechanism from us. >> >> My gut feeling is that overhead will pale to insignificance compared to >> the network and other overheads of actually getting the WAL to the >> standby and processing the acknowledgments. > > You're ignoring Josh's points. Those exact points have been made by me > in support of the design of my patch and against Fujii's. The mechanism > to do this will be more complex and more likely to break. And it will be > slower and that is a concern for me. I don't think Heikki ignored Josh's points, and I do think Heikki's analysis is correct. >> > D. Adding/Replacing Quorum Members >> > ---------------------------------- >> > For Quorum commit to be really valuable, we need to be able to add new >> > quorum members and remove dead ones *without stopping the master*. Per >> > discussion about the startup issues with only one master, we have not >> > worked out how to do this for synch rep standbys. It's reasonable to >> > assume that this will be more complex for a quorum group than with a >> > single synch standby. >> > >> > Consider the case, for example, where due to a network outage we have >> > dropped below quorum. What is the strategy for getting the system >> > running again by adding standbys? >> >> You start a new one from the latest base backup and let it catch up? >> Possibly modifying the config file in the master to let it know about >> the new standby, if we go down that path. This part doesn't seem >> particularly hard to me. > > Agreed, not sure of the issue there. Also agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-10-05 at 13:43 -0700, Josh Berkus wrote: > Again, I'm just saying that merely doing single-server synch rep, > *and* > making HS/SR easier to admin in general, is going to be a big task for > 9.1. Quorum Commit needs to be considered a separate feature, and one > which is dispensible for 9.1. Agreed. So no need at all for standby.conf. Phew! -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Tue, 2010-10-05 at 17:21 -0400, Robert Haas wrote: > On Tue, Oct 5, 2010 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > The points appear to be directed at "quorum commit", which is a name > > I've used. But most of the points apply more to Fujii's patch than my > > own. I can only presume that Josh wants to prevent us from adopting a > > design that allows sync against multiple standbys. > > This looks to me like a cheap shot that doesn't advance the > discussion. You are the first to complain when people don't take your > ideas as seriously as you feel they should. Whatever are you talking about? This is a technical discussion. I'm checking what Josh actually means by Quorum Commit, since regrettably the points fall very badly against Fujii's patch. Josh has echoed some points of mine and Jeff's point about dangerous behaviour blows a hole a mile wide in the justification for standby.conf etc.. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon, Robert, > The points appear to be directed at "quorum commit", which is a name > I've used. But most of the points apply more to Fujii's patch than my > own. Per previous discussion, I'm trying to get at what reasonable behavior is, rather than targeting one patch or the other. > I can only presume that Josh wants to prevent us from adopting a > design that allows sync against multiple standbys. Quorum commit == "X servers need to ack for commit", where X > 1. Usually done as "X out of Y servers must ack", but it's not a given that the master needs to know how many servers there are, just how many ack'ed. And I'm not against it; I'm just pointing out that it gives us some issues which we don't have with a single standby, and thus quorum commit ought to be treated as a separate feature in 9.1 development. >> The master can not roll back or cancel the transaction. That's >> completely infeasible, the WAL record has been written to local disk >> already. The best it can do is halt and wait for enough standbys to >> appear to fulfill the quorum. The client will hang waiting for the >> COMMIT to finish, and the transaction will appear as in-progress to >> other transactions. > > Yes, that point has long been understood. Neither patch does this, and > in fact the issue is a completely general one. So, in that case, if it's been 10 minutes, and we're still not getting ack from standbys, what's the exit strategy for the hapless DBA? Practically speaking? Without restarting the master? Last I checked, our goal with synch standby was to increase availablity, not decrease it. This is, however, not an issue with quorum commit, but an issue with sync rep in general. > Could the person that wrote that actually explain what a "specific > window of synchronicity" is? I'm not sure whether to agree, or disagree. A specific amount of time within which all nodes will be consistent regarding that specific transaction. >> You start a new one from the latest base backup and let it catch up? >> Possibly modifying the config file in the master to let it know about >> the new standby, if we go down that path. This part doesn't seem >> particularly hard to me. > > Agreed, not sure of the issue there. See previous post. The critical phrase is *without restarting the master*. AFAICT, no patch has addressed the need to change the master's synch configuration without restarting it. It's possible that I'm not following something, in which case I'd love to have it pointed out. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, 2010-10-05 at 15:14 -0700, Josh Berkus wrote: > > I can only presume that Josh wants to prevent us from adopting a > > design that allows sync against multiple standbys. > > Quorum commit == "X servers need to ack for commit", where X > 1. > Usually done as "X out of Y servers must ack", but it's not a given that > the master needs to know how many servers there are, just how many ack'ed. > > And I'm not against it; I'm just pointing out that it gives us some > issues which we don't have with a single standby, and thus quorum commit > ought to be treated as a separate feature in 9.1 development. OK, so I did understand you correctly. Heikki had argued that a use case existed where Y out of Y (i.e. all) nodes must acknowledge before we commit. That was the use case that required us to have standby registration. It was optional in all other cases. We should note that Oracle only allows X=1, i.e. first acknowledgement releases waiter. My patch provides X=1 only and takes advantage of the simpler in-memory data structures as a result. > >> The master can not roll back or cancel the transaction. That's > >> completely infeasible, the WAL record has been written to local disk > >> already. The best it can do is halt and wait for enough standbys to > >> appear to fulfill the quorum. The client will hang waiting for the > >> COMMIT to finish, and the transaction will appear as in-progress to > >> other transactions. > > > > Yes, that point has long been understood. Neither patch does this, and > > in fact the issue is a completely general one. > > So, in that case, if it's been 10 minutes, and we're still not getting > ack from standbys, what's the exit strategy for the hapless DBA? > Practically speaking? Without restarting the master? > > Last I checked, our goal with synch standby was to increase availablity, > not decrease it. This is, however, not an issue with quorum commit, but > an issue with sync rep in general. Completely agree. When we had that discussion some months/weeks back, we spoke about having a timeout. My patch has implemented a timeout, followed by a COMMIT. That allows increased availability, as you say. You would also be able to specifically release all/some transactions from wait state with a simple function pg_cancel_sync_wait() (or similar name). > > Could the person that wrote that actually explain what a "specific > > window of synchronicity" is? I'm not sure whether to agree, or disagree. > > A specific amount of time within which all nodes will be consistent > regarding that specific transaction. Certainly no patch offers that. I'm not sure such a possibility exists. Asking for higher X does make that situation worse. > >> You start a new one from the latest base backup and let it catch up? > >> Possibly modifying the config file in the master to let it know about > >> the new standby, if we go down that path. This part doesn't seem > >> particularly hard to me. > > > > Agreed, not sure of the issue there. > > See previous post. The critical phrase is *without restarting the > master*. AFAICT, no patch has addressed the need to change the master's > synch configuration without restarting it. It's possible that I'm not > following something, in which case I'd love to have it pointed out. My patch does not require a restart of the master to add/remove sync rep nodes. They just come and go as needed. I don't think Fujii's patch would have a great problem with that either, but I can't speak for that with precision. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
> Heikki had argued that a use case existed where Y out of Y (i.e. all) > nodes must acknowledge before we commit. That was the use case that > required us to have standby registration. It was optional in all other > cases. Yeah, Y of Y is just a special case of X of Y. And, IMHO, rather pointless if we can't guarantee consistency between the standbys, which we can't. > We should note that Oracle only allows X=1, i.e. first acknowledgement > releases waiter. My patch provides X=1 only and takes advantage of the > simpler in-memory data structures as a result. I agree that we ought to start with X=1 for 9.1 and leave more complicated architectures until we have that committed and tested. > You would also be able to specifically release all/some transactions > from wait state with a simple function pg_cancel_sync_wait() (or similar > name). That would be fine for the use cases I'll be implementing. > My patch does not require a restart of the master to add/remove sync rep > nodes. They just come and go as needed. > > I don't think Fujii's patch would have a great problem with that either, > but I can't speak for that with precision. Ok. That really was not made clear in prior arguments. FYI, for the production uses of synch rep I'd specifically be implementing, what the users would want is: 1) One master, one synch standby, 1-2 asynch standbys 2) Synch rep tries to synch for # seconds. 3) If it fails, it switches the synch standby to asynch and screams bloody murder somewhere nagios can pick it up. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, 2010-10-05 at 22:19 +0100, Simon Riggs wrote: > > In other words, a lagging standby combined with a timeout mechanism is > > essentially useless, because it will never catch up in time to be a part > > of the quorum. > > Thanks for explaining what was meant. > > This issue is a serious problem with the apply to *all* servers that > Heikki has been describing as being a useful use case. We register a > standby, it goes down and we decide to wait for it. Then when it does > come back up it takes ages to catch up. > > This is really the nail in the coffin for the "All" servers use case, > and a significant blow to the requirement for standby registration. I'm not sure I entirely understand. I was concerned about the case of a standby server being allowed to lag behind the rest by a large number of WAL records. That can't happen in the "wait for all servers to apply" case, because the system would become unavailable rather than allow a significant difference in the amount of WAL applied. I'm not saying that an unavailable system is good, but I don't see how my particular complaint applies to the "wait for all servers to apply" case. The case I was worried about is:* 1 master and 2 standby* The rule is "wait for at least one standby to apply the WAL" In your notation, I believe that's M -> { S1, S2 } In that case, if one S1 is just a little faster than S2, then S2 might build up a significant queue of unapplied WAL. Then, when S1 goes down, there's no way for the slower one to acknowledge a new transaction without playing through all of the unapplied WAL. Intuitively, the administrator would think that he was getting both HA and redundancy, but in reality the availability is no better than if there were only two servers (M -> S1), except that it might be faster to replay the WAL then to set up a new standby (but that's not guaranteed). I think you would call that a misconfiguration, and I would agree. I was just trying to point out a pitfall that I didn't see until I read Josh's email. > If we use N+1 redundancy as I have explained, then this situation does > not occur until you have less than N standbys available. But then it's > no surprise that RAID-5 won't work with 4 drives either. Now I'm more confused. I assume that was a typo (because a RAID-5 does work with 4 drives), but I think it obscured your point. Regards,Jeff Davis
On Tue, 2010-10-05 at 18:52 -0700, Jeff Davis wrote: > I'm not saying that an unavailable system is good, but I don't see how > my particular complaint applies to the "wait for all servers to apply" > case. > The case I was worried about is: > * 1 master and 2 standby > * The rule is "wait for at least one standby to apply the WAL" > > In your notation, I believe that's M -> { S1, S2 } > > In that case, if one S1 is just a little faster than S2, then S2 might > build up a significant queue of unapplied WAL. Then, when S1 goes down, > there's no way for the slower one to acknowledge a new transaction > without playing through all of the unapplied WAL. That situation would require two things * First, you have set up async replication and you're not monitoring it properly. Shame on you. * Second, you would have to request "apply" mode sync rep. If you had requested "recv" or "fsync" mode, then the standby does *not* have to have applied the WAL before acknowledgement. Since the first problem is a generic problem with async replication, and can already happen in 8.2+, its not exactly an argument against a new feature. > Intuitively, the administrator would think that he was getting both HA > and redundancy, but in reality the availability is no better than if > there were only two servers (M -> S1), except that it might be faster to > replay the WAL then to set up a new standby (but that's not guaranteed). Not guaranteed, but very likely that the standby would not be that far behind. If it gets too far behind it will likely blow out the disk space on the standby and fail. > I think you would call that a misconfiguration, and I would agree. Yes, regrettably there are various ways to misconfigure this. The above is really a degeneration of the 2 standby case into the 1 standby case: if you ask for 2 standbys and one of them is ineffective, then the system acts like you have only one. > I was > just trying to point out a pitfall that I didn't see until I read Josh's > email. You mention that it cannot occur if we choose to lock up the master and cause transactions to wait. That may be true in many cases. It does still occur when we have transactions that generate a large amount of WAL, loads, ALTER TABLEs etc.. In those cases, S2 could well fall far behind S1 during those long transactions and if S1 goes down at that point there would be a backlog to apply. But again, this only applies to "apply" mode sync rep. So it can occur in both cases, though it now looks to me that its less important an issue in either case. So I think this doesn't rate the term dangerous to describe it any longer. Thanks for your careful thought and analysis on this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 06.10.2010 01:14, Josh Berkus wrote: > Last I checked, our goal with synch standby was to increase availablity, > not decrease it. No. Synchronous replication does not help with availability. It allows you to achieve zero data loss, ie. if the master dies, you are guaranteed that any transaction that was acknowledged as committed, is still committed. The other use case is keeping a hot standby server (or servers) up-to-date, so that you can run queries against it and you are guaranteed to get the same results you would if you ran the query in the master. Those are the two reasonable use cases I've seen. Anything else that has been discussed is some sort of a combination of those two, or something that doesn't make much sense when you scratch the surface and start looking at the failure modes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 06.10.2010 01:14, Josh Berkus wrote: >>> You start a new one from the latest base backup and let it catch up? >>> Possibly modifying the config file in the master to let it know about >>> the new standby, if we go down that path. This part doesn't seem >>> particularly hard to me. >> >> Agreed, not sure of the issue there. > > See previous post. The critical phrase is *without restarting the > master*. AFAICT, no patch has addressed the need to change the master's > synch configuration without restarting it. It's possible that I'm not > following something, in which case I'd love to have it pointed out. Fair enough. I agree it's important that the configuration can be changed on the fly. It's orthogonal to the other things discussed, so let's just assume for now that we'll have that. If not in the first version, it can be added afterwards. "pg_ctl reload" is probably how it will be done. There is some interesting behavioral questions there on what happens when the configuration is changed. Like if you first define that 3 out of 5 servers must acknowledge, and you have an in-progress commit that has received 2 acks already. If you then change the config to "2 out of 4" servers must acknowledge, is the in-progress commit now satisfied? From the admin point of view, the server that was removedfrom the system might've been one that had acknowledged already, and logically in the new configuration the transaction has only received 1 acknowledgment from those servers that are still part of the system. Explicitly naming the standbys in the config file would solve that particular corner case, but it would no doubt introduce other similar ones. But it's an orthogonal issue, we'll figure it out when we get there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/06/2010 04:31 AM, Simon Riggs wrote: > That situation would require two things > * First, you have set up async replication and you're not monitoring it > properly. Shame on you. The way I read it, Jeff is complaining about the timeout you propose that effectively turns sync into async replication in case of a failure. With a master that waits forever, the standby that's newly required for quorum certainly still needs its time to catch up. But it wouldn't live in danger of being "optimized away" for availability in case it cannot catch up within the given timeout. It's a tradeoff between availability and durability. > So it can occur in both cases, though it now looks to me that its less > important an issue in either case. So I think this doesn't rate the term > dangerous to describe it any longer. The proposed timeout certainly still sounds dangerous to me. I'd rather recommend setting it to an incredibly huge value to minimize its dangers and get sync replication when that is what has been asked for. Use async replication for increased availability. Or do you envision any use case that requires a quorum of X standbies for normal operation but is just fine with only none to (X-1) standbies in case of failures? IMO that's when sync replication is most needed and when it absolutely should hold to its promises - even if it means to stop the system. There's no point in continuing operation if you cannot guarantee the minimum requirements for durability. If you happen to want such a thing, you should better rethink your minimum requirement (as performance for normal operations might benefit from a lower minimum as well). Regards Markus Wanner
On Wed, Oct 6, 2010 at 10:52 AM, Jeff Davis <pgsql@j-davis.com> wrote: > I'm not sure I entirely understand. I was concerned about the case of a > standby server being allowed to lag behind the rest by a large number of > WAL records. That can't happen in the "wait for all servers to apply" > case, because the system would become unavailable rather than allow a > significant difference in the amount of WAL applied. > > I'm not saying that an unavailable system is good, but I don't see how > my particular complaint applies to the "wait for all servers to apply" > case. > > The case I was worried about is: > * 1 master and 2 standby > * The rule is "wait for at least one standby to apply the WAL" > > In your notation, I believe that's M -> { S1, S2 } > > In that case, if one S1 is just a little faster than S2, then S2 might > build up a significant queue of unapplied WAL. Then, when S1 goes down, > there's no way for the slower one to acknowledge a new transaction > without playing through all of the unapplied WAL. > > Intuitively, the administrator would think that he was getting both HA > and redundancy, but in reality the availability is no better than if > there were only two servers (M -> S1), except that it might be faster to > replay the WAL then to set up a new standby (but that's not guaranteed). Agreed. This is similar to my previous complaint. http://archives.postgresql.org/pgsql-hackers/2010-09/msg00946.php This problem would happen even if we fix the quorum to 1 as Josh propose. To avoid this, the master must wait for ACK from all the connected synchronous standbys. I think that this is likely to happen especially when we choose 'apply' replication level. Because that level can easily lag a synchronous standby because of the conflict between recovery and read-only query. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 10/06/2010 08:31 AM, Heikki Linnakangas wrote: > On 06.10.2010 01:14, Josh Berkus wrote: >> Last I checked, our goal with synch standby was to increase availablity, >> not decrease it. > > No. Synchronous replication does not help with availability. It allows > you to achieve zero data loss, ie. if the master dies, you are > guaranteed that any transaction that was acknowledged as committed, is > still committed. Strictly speaking, it even reduces availability. Which is why nobody actually wants *only* synchronous replication. Instead they use quorum commit or semi-synchronous (shudder) replication, which only requires *some* nodes to be in sync, but effectively replicates asynchronously to the others. From that point of view, the requirement of having one synch and two async standbies is pretty much the same as having three synch standbies with a quorum commit of 1. (Except for additional availability of the later variant, because in case of a failure of the one sync standby, any of the others can take over without admin intervention). Regards Markus Wanner
On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > No. Synchronous replication does not help with availability. It allows you > to achieve zero data loss, ie. if the master dies, you are guaranteed that > any transaction that was acknowledged as committed, is still committed. Hmm.. but we can increase availability without any data loss by using synchronous replication. Many people have already been using synchronous replication softwares such as DRBD for that purpose. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 06.10.2010 11:09, Fujii Masao wrote: > On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> No. Synchronous replication does not help with availability. It allows you >> to achieve zero data loss, ie. if the master dies, you are guaranteed that >> any transaction that was acknowledged as committed, is still committed. > > Hmm.. but we can increase availability without any data loss by using > synchronous > replication. Many people have already been using synchronous > replication softwares > such as DRBD for that purpose. Sure, but it's not the synchronous aspect that increases availability. It's the replication aspect, and we already have that. Making the replication synchronous allows zero data loss in case the master suddenly dies, but it comes at the cost of availability. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/06/2010 10:17 AM, Heikki Linnakangas wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. > It's the replication aspect, and we already have that. ..the *asynchronous* replication aspect, yes. The drdb.conf man page [1] describes parameters of DRDB. It's worth noting that even in "Protocol C" (synchronous mode), they sport a timeout of only 6 seconds (by default). After that, the primary node proceeds without any kind of guarantee (which can be thought of as switching to async replication). Just as Simon proposes for Postgres as well. Maybe that really is enough for now. Everybody that needs stricter durability guarantees needs to wait for Postgres-R ;-) Regards Markus Wanner [1]: drdb.conf man page: http://www.drbd.org/users-guide/re-drbdconf.html
On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> >> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> No. Synchronous replication does not help with availability. It allows >>> you >>> to achieve zero data loss, ie. if the master dies, you are guaranteed >>> that >>> any transaction that was acknowledged as committed, is still committed. >> >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. It's > the replication aspect, and we already have that. Making the replication > synchronous allows zero data loss in case the master suddenly dies, but it > comes at the cost of availability. Yep. But I mean that the synchronous aspect is helpful to increase the availability of the system which requires no data loss. In asynchronous replication, when the master goes down, we have to salvage the missing WAL for the standby from the failed master to avoid data loss. This would take very long and decrease the availability of the system which doesn't accept any data loss. Since the synchronous doesn't require such a salvage, it can increase the availability of such a system. If we want only no data loss, we have only to implement the wait-forever option. But if we make consideration for the above-mentioned availability, the return-immediately option also would be required. In some (many, I think) cases, I think that we need to consider availability and no data loss together, and consider the balance of them. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 06.10.2010 11:39, Markus Wanner wrote: > On 10/06/2010 10:17 AM, Heikki Linnakangas wrote: >> On 06.10.2010 11:09, Fujii Masao wrote: >>> Hmm.. but we can increase availability without any data loss by using >>> synchronous >>> replication. Many people have already been using synchronous >>> replication softwares >>> such as DRBD for that purpose. >> >> Sure, but it's not the synchronous aspect that increases availability. >> It's the replication aspect, and we already have that. > > ..the *asynchronous* replication aspect, yes. > > The drdb.conf man page [1] describes parameters of DRDB. It's worth > noting that even in "Protocol C" (synchronous mode), they sport a > timeout of only 6 seconds (by default). Wow, that is really short. Are you sure? I have no first hand experience with DRBD, and reading that man page, I get the impression that the timeout us just for deciding that the TCP connection is dead. There is also the ko-count parameter, which defaults to zero. I would guess that ko-count=0 is "wait forever", while ko-count=1 is what you described, but I'm not sure. It's not hard to imagine the master failing in a way that first causes the connection to standby to drop, and the disk failing 6 seconds later. A fire that destroys the network cable first and then spreads to the disk array for example. > [1]: drdb.conf man page: > http://www.drbd.org/users-guide/re-drbdconf.html -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 06.10.2010 11:49, Fujii Masao wrote: > On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Sure, but it's not the synchronous aspect that increases availability. It's >> the replication aspect, and we already have that. Making the replication >> synchronous allows zero data loss in case the master suddenly dies, but it >> comes at the cost of availability. > > Yep. But I mean that the synchronous aspect is helpful to increase the > availability of the system which requires no data loss. In asynchronous > replication, when the master goes down, we have to salvage the missing > WAL for the standby from the failed master to avoid data loss. This would > take very long and decrease the availability of the system which doesn't > accept any data loss. Since the synchronous doesn't require such a salvage, > it can increase the availability of such a system. In general, salvaging the WAL that was not sent to the standby yet is outright impossible. You can't achieve zero data loss with asynchronous replication at all. > If we want only no data loss, we have only to implement the wait-forever > option. But if we make consideration for the above-mentioned availability, > the return-immediately option also would be required. > > In some (many, I think) cases, I think that we need to consider availability > and no data loss together, and consider the balance of them. If you need both, you need three servers as Simon pointed out earlier. There is no way around that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/06/2010 10:53 AM, Heikki Linnakangas wrote: > Wow, that is really short. Are you sure? I have no first hand experience > with DRBD, Neither do I. > and reading that man page, I get the impression that the > timeout us just for deciding that the TCP connection is dead. There is > also the ko-count parameter, which defaults to zero. I would guess that > ko-count=0 is "wait forever", while ko-count=1 is what you described, > but I'm not sure. Yeah, sounds more likely. Then I'm surprised that I didn't find any warning that the Protocol C definitely reduces availability (with the ko-count=0 default, that is). Instead, they only state that it's the most used replication mode, which really makes me wonder. [1] Sorry for adding confusion by not researching properly. Regards Markus Wanner [1] DRDB Repliaction Modes http://www.drbd.org/users-guide-emb/s-replication-protocols.html
On Wed, Oct 6, 2010 at 10:17, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> >> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> No. Synchronous replication does not help with availability. It allows >>> you >>> to achieve zero data loss, ie. if the master dies, you are guaranteed >>> that >>> any transaction that was acknowledged as committed, is still committed. >> >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. It's > the replication aspect, and we already have that. Making the replication > synchronous allows zero data loss in case the master suddenly dies, but it > comes at the cost of availability. That's only for a narrow definition of availability. For a lot of people, having access to your data isn't considered availability unless you can trust the data... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 06.10.2010 13:41, Magnus Hagander wrote: > That's only for a narrow definition of availability. For a lot of > people, having access to your data isn't considered availability > unless you can trust the data... Ok, fair enough. For that, synchronous replication in the "wait forever" mode is the only alternative. That on its own doesn't give you any boost in availability, on the contrary, but coupled with suitable clustering tools to handle failover and deciding when the standby is dead, you can achieve that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Markus Wanner <markus@bluegap.ch> writes: > On 10/06/2010 04:31 AM, Simon Riggs wrote: >> That situation would require two things >> * First, you have set up async replication and you're not monitoring it >> properly. Shame on you. > > The way I read it, Jeff is complaining about the timeout you propose > that effectively turns sync into async replication in case of a failure. > > With a master that waits forever, the standby that's newly required for > quorum certainly still needs its time to catch up. But it wouldn't live > in danger of being "optimized away" for availability in case it cannot > catch up within the given timeout. It's a tradeoff between availability > and durability. What is necessary here is a clear view on the possible states that a standby can be in at any time, and we must stop trying to apply to some non-ready standby the behavior we want when it's already in-sync. From my experience operating londiste, those states would be: 1. base-backup — self explaining2. catch-up — getting the WAL to catch up after base backup3. wanna-sync — don't yethave all the WAL to get in sync4. do-sync — all WALs are there, coming soon5. ok (async | recv | fsync | reply —feedback loop engaged) So you only consider that a standby is a candidate for sync rep when it's reached the ok state, and that's when it's able to fill the feedback loop we've been talking about. Standby state != ok, no waiting no nothing, it's *not* a standby as far as the master is concerned. The other states allow to manage accepting a new standby into an existing setup, and to manage error failures. When we stop receiving the feedback loop events, the master knows the slave ain't in the "ok" state any more and can demote it to "wanna-sync", because it has to keep WALs until the slave comes again. If the standby is not back online and wal_keep_segments makes it so that we can't keep its wal anymore, the state gets back to "base-backup". Not going into every details here (for example, we might need some protocol arbitrage for the standby to be able to explain the master that it's ok even if the master thinks it's not), but my point is that without a clear list of standby states, we're going to hinder the master in situations where it makes no sense to do so. > Or do you envision any use case that requires a quorum of X standbies > for normal operation but is just fine with only none to (X-1) standbies > in case of failures? IMO that's when sync replication is most needed and > when it absolutely should hold to its promises - even if it means to > stop the system. > > There's no point in continuing operation if you cannot guarantee the > minimum requirements for durability. If you happen to want such a thing, > you should better rethink your minimum requirement (as performance for > normal operations might benefit from a lower minimum as well). This part of the discussion made me think of yet another refinement on the Quorum Commit idea, even if I'm beginning to think that can be material for later. Basic Quorum Commit is having each transaction on the master wait for a total number of votes to accept the transaction synced. Each standby has a weight, meaning 1 or more votes. The problem is the flexibility isn't there, some cases are impossible to setup. Also people want to be able to specify their favorite standby and that's quickly uneasy. Idea : segment the votes into "colors" or any categories you like. Have each standby be a member of a category list, and require per-category quorums to be reached. This is the same as attributing roles to standbys and saying that they're all equivalent as soon as part of the given role, with the added flexibility that you can sometime want more than one standby of a given role to take part of the quorum. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 06.10.2010 15:22, Dimitri Fontaine wrote: > What is necessary here is a clear view on the possible states that a > standby can be in at any time, and we must stop trying to apply to > some non-ready standby the behavior we want when it's already in-sync. > > From my experience operating londiste, those states would be: > > 1. base-backup — self explaining > 2. catch-up — getting the WAL to catch up after base backup > 3. wanna-sync — don't yet have all the WAL to get in sync > 4. do-sync — all WALs are there, coming soon > 5. ok (async | recv | fsync | reply — feedback loop engaged) > > So you only consider that a standby is a candidate for sync rep when > it's reached the ok state, and that's when it's able to fill the > feedback loop we've been talking about. Standby state != ok, no waiting > no nothing, it's *not* a standby as far as the master is concerned. You're not going to get zero data loss that way. Can you elaborate what the use case for that mode is? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote: > You're not going to get zero data loss that way. Ending the wait state does not cause data loss. It puts you at *risk* of data loss, which is a different thing entirely. If you want to avoid data loss you use N+k redundancy and get on with life, rather than sitting around waiting. Putting in a feature for people that choose k=0 seems wasteful to me, since they knowingly put themselves at risk in the first place. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> 1. base-backup — self explaining >> 2. catch-up — getting the WAL to catch up after base backup >> 3. wanna-sync — don't yet have all the WAL to get in sync >> 4. do-sync — all WALs are there, coming soon >> 5. ok (async | recv | fsync | reply — feedback loop engaged) >> >> So you only consider that a standby is a candidate for sync rep when >> it's reached the ok state, and that's when it's able to fill the >> feedback loop we've been talking about. Standby state != ok, no waiting >> no nothing, it's *not* a standby as far as the master is concerned. > > You're not going to get zero data loss that way. Can you elaborate what the > use case for that mode is? You can't pretend to sync with zero data loss until the standby is ready for it, or you need to take the site down while you add your standby. I can see some user willing to take the site down while doing the base backup dance then waiting for initial sync, then only accepting traffic and being secure against data loss, but I'd much rather that be an option and you could watch for your standby's state in a system view. Meanwhile, I can't understand any reason for the master to pretend it can safely manage any sync-rep transaction while there's no standby around. Either you wait for the quorum and don't have it, or you have to track standby states with precision and maybe actively reject writes. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 06.10.2010 17:20, Simon Riggs wrote: > On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote: > >> You're not going to get zero data loss that way. > > Ending the wait state does not cause data loss. It puts you at *risk* of > data loss, which is a different thing entirely. Looking at it that way, asynchronous replication just puts you at risk of data loss too, it doesn't necessarily mean you get data loss. The key is whether you are guaranteed to have zero data loss or not. If you don't wait forever, you're not guaranteed zero data loss. It's just best effort, like asynchronous replication. The situation you want to avoid is that the master dies, and you don't know if you have suffered data loss or not. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 06.10.2010 18:02, Dimitri Fontaine wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >>> 1. base-backup — self explaining >>> 2. catch-up — getting the WAL to catch up after base backup >>> 3. wanna-sync — don't yet have all the WAL to get in sync >>> 4. do-sync — all WALs are there, coming soon >>> 5. ok (async | recv | fsync | reply — feedback loop engaged) >>> >>> So you only consider that a standby is a candidate for sync rep when >>> it's reached the ok state, and that's when it's able to fill the >>> feedback loop we've been talking about. Standby state != ok, no waiting >>> no nothing, it's *not* a standby as far as the master is concerned. >> >> You're not going to get zero data loss that way. Can you elaborate what the >> use case for that mode is? > > You can't pretend to sync with zero data loss until the standby is ready > for it, or you need to take the site down while you add your standby. > > I can see some user willing to take the site down while doing the base > backup dance then waiting for initial sync, then only accepting traffic > and being secure against data loss, but I'd much rather that be an > option and you could watch for your standby's state in a system view. > > Meanwhile, I can't understand any reason for the master to pretend it > can safely manage any sync-rep transaction while there's no standby > around. Either you wait for the quorum and don't have it, or you have to > track standby states with precision and maybe actively reject writes. I'm sorry, but I still don't understand the use case you're envisioning. How many standbys are there? What are you trying to achieve with synchronous replication over what asynchronous offers? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/06/2010 04:20 PM, Simon Riggs wrote: > Ending the wait state does not cause data loss. It puts you at *risk* of > data loss, which is a different thing entirely. These kind of risk scenarios is what sync replication is all about. A minimum guarantee that doesn't hold in face of the first few failures (see Jeff's argument) isn't worth a dime. Keep in mind that upon failure, the other nodes presumably get more load. As has been seen with RAID, that easily leads to subsequent failures. Sync rep needs to be able to protect against that *as well*. > If you want to avoid data loss you use N+k redundancy and get on with > life, rather than sitting around waiting. With that notion, I'd argue that quorum_commit needs to be set to exactly k, because any higher value would only cost performance without any useful benefit. But if I want at least k ACKs and if I think it's worth the performance penalty that brings during normal operation, I want that guarantee to hold true *especially* in case of an emergency. If availability is more important, you need to increase N and make sure enough of these (asynchronously) replicated nodes stay up. Increase k (thus quorum commit) for a stronger durability guarantee. > Putting in a feature for people that choose k=0 seems wasteful to me, > since they knowingly put themselves at risk in the first place. Given the above logic, k=0 equals to completely async replication. Not sure what's wrong about that. Regards Markus Wanner
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I'm sorry, but I still don't understand the use case you're envisioning. How > many standbys are there? What are you trying to achieve with synchronous > replication over what asynchronous offers? Sorry if I've been unclear, I read loads of message then tried to pick up the right one to answer, and obviously missed to spell out some context. My concern starts with only 1 standby, and is in fact 2 questions: - Why o why you wouldn't be able to fix your sync setup in the master as soon as there's a standby doing a base backup? - when do you start considering the standby as a candidate to your sync rep requirements? Lots of the discussion we're having are taking as an implicit that the answer is "as soon as you know about its existence, that must be at the pg_start_backup() point". I claim that's incorrect, and you can't ask the master to wait forever until the standby is in sync. All the more because there's a window with wal_keep_segments here too, so the sync might never happen. To solve that problem, I propose managing current state of the standby. That means auto registration of any standby, and feedback loop at more stages, and some protocol arbitrage for the standbyto be able to say "I'm this far actually" so that the master can know how to consider it, rather than just demoteit while live. One you have a clear list of possible states for a standby, and can decide on what errors are meaning in terms of transitions in the state machine, you're able to decide when wait forever is an option and when you should ignore it or refuse any side-effect transaction commit. And you can offer an option to guarantee the wait-forever behavior only when it makes sense, rather than trying to catch your own tail as soon as a standby is added in the mix, with the proposals I've read on how you can't even restart the master at this point. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
All, Let me clarify and consolidate this discussion. Again, it's my goal that this thread specifically identify only the problems and desired behaviors for synch rep with more than one sync standby. There are several issues with even one sync standby which still remain unresolved, but I believe that we should discuss those on a separate thread, for clarity. I also strongly believe that we should get single-standby functionality committed and tested *first*, before working further on multi-standby. So, to summarize earlier discussion on this thread: There are 2 reasons to have more than one sync standby: 1) To increase durability above the level of a single synch standby, even at the cost of availability. 2) To increase availability without decreasing durability below the level offered by a single sync standby. The "pure" setup for each of these options, where N is the number of standbys and k is the number of acks required from standbys is: 1) k = N, N > 1, apply 2) k = 1, N > 1, recv (Timeouts are a specific compromise of durability for availability on *one* server, and as such will not be discussed here. BTW, I was the one who suggested a timeout, rather than Simon, so if you don't like the idea, harass me about it.) Any other configuration (3) than the two above is a specific compromise between durability and availability, for example: 3a) k = 2, N = 3, fsync 3b) k = 3, N = 10, recv ... should give you better durability than case 2) and better availability than case 1). While it's nice to dismiss case (1) as an edge-case, consider the likelyhood of someone running PostgreSQL with fsync=off on cloud hosting. In that case, having k = N = 5 does not seem like an unreasonable arrangement if you want to ensure durability via replication. It's what the CAP databases do. After eliminating some of my issues as non-issues, here's what we're left with for problems on the above: (1), (3) Accounting/Registration. Implementing any of these cases would seem to require some form of accounting and/or registration on the master in terms of, at a minimum, the number of acks for each data send.More likely we will need, as proposed on other threads,a register of standbys and the sync state of each. Not only will this accounting/registration be hard code to write, it will have at least *some* performance overhead. Whether that overhead is minority or substantial can only be determined through testing. Further, there's the issue of whether, and how, we transmit this register to the standbys so that they can be promoted. (2), (3) Degradation: (Jeff) these two cases make sense only if we give DBAs the tools they need to monitor which standbys are falling behind, and to drop and replace those standbys. Otherwise we risk giving DBAs false confidence that they have better-than-1-standby reliability when actually they don't. Current tools are not really adequate for this. (1), (3) Dynamic Re-configuration: we need the ability to add and remove standbys at runtime. We also need to have a verdict on how to handle the case where a transaction is pending, per Heikki. (2), (3) Promotion: all multi-standby high-availability cases only make sense if we provide tools to promote the most current standby to be the new master. Otherwise the whole cluster still goes down whenever we have to replace the master. We also should provide some mechanism for promoting an async standby to sync; this has already been discussed. (1) Consistency: this is another DBA-false-confidence issue. DBAs who implement (1) are liable to do so thinking that they are not only guaranteeing the consistency of every standby with the master, but the consistency of every standby with every other standby -- a kind of dummy multi-master. They are not, so it will take multiple reminders and workarounds in the docs to explain this. And we'll get complaints anyway. (1), (2), (3) Initialization: (Dimitri) we need a process whereby a standby can go from cloned to synched to being a sync rep standby, and possibly from degraded to synced again and back. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Hello Dimitri, On 10/06/2010 05:41 PM, Dimitri Fontaine wrote: > - when do you start considering the standby as a candidate to your sync > rep requirements? That question doesn't make much sense to me. There's no point in time I ever mind if a standby is a "candidate" or not. Either I want to synchronously replicate to X standbies, or not. > Lots of the discussion we're having are taking as an implicit that the > answer is "as soon as you know about its existence, that must be at the > pg_start_backup() point". This is an admin decision. Whether or not your standbies are up and running or not, existing or just about to be bought, that doesn't have any impact on your durability requirements. If you want your banking accounts data to be saved in at least two different locations, I think that's your requirement. You'd be quite unhappy if your bank lost your last month's salary, but stated: "hey, at least we didn't have any downtime". > And you can offer an option to guarantee the wait-forever behavior only > when it makes sense, rather than trying to catch your own tail as soon > as a standby is added in the mix Of course, it doesn't make sense to wait-forever on *every* standby that ever gets added. Quorum commit is required, yes (and that's what this thread is about, IIRC). But with quorum commit, adding a standby only improves availability, but certainly doesn't block the master in any way. (Quite the opposite: it can allow the master to continue, if it has been blocked before because the quorum hasn't been reached). Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > There's no point in time I > ever mind if a standby is a "candidate" or not. Either I want to > synchronously replicate to X standbies, or not. Ok so I think we're agreeing here: what I said amounts to propose that the code does work this way when the quorum is such setup, and/or is able to reject any non-read-only transaction (those that needs a real XID) until your standby is fully in sync. I'm just saying that this should be an option, not the only choice. And that by having a clear view of the system's state, it's possible to have a clear error response policy set out. > This is an admin decision. Whether or not your standbies are up and > running or not, existing or just about to be bought, that doesn't have > any impact on your durability requirements. Depends, lots of things out there work quite well in best effort mode, even if some projects needs more careful thinking. That's again the idea of waiting forever or just continuing, there's a middle-ground which is starting the system before reaching the durability requirements or downgrading it to read only, or even off, until you get them. You can read my proposal as a way to allow our users to choose between those two incompatible behaviours. > Of course, it doesn't make sense to wait-forever on *every* standby that > ever gets added. Quorum commit is required, yes (and that's what this > thread is about, IIRC). But with quorum commit, adding a standby only > improves availability, but certainly doesn't block the master in any > way. (Quite the opposite: it can allow the master to continue, if it has > been blocked before because the quorum hasn't been reached). If you ask for a quorum larger than what the current standbys are able to deliver, and you're set to wait forever until the quorum is reached, you just blocked the master. Good news is that the quorum is a per-transaction setting, so opening a superuser connection to act on the currently waiting transaction is still possible (pass/fail, but fail is what at this point? shutdown to wait some more offline?). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 06.10.2010 20:57, Josh Berkus wrote: > While it's nice to dismiss case (1) as an edge-case, consider the > likelyhood of someone running PostgreSQL with fsync=off on cloud > hosting. In that case, having k = N = 5 does not seem like an > unreasonable arrangement if you want to ensure durability via > replication. It's what the CAP databases do. Seems reasonable, but what is a CAP database? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/06/2010 09:04 PM, Dimitri Fontaine wrote: > Ok so I think we're agreeing here: what I said amounts to propose that > the code does work this way when the quorum is such setup, and/or is > able to reject any non-read-only transaction (those that needs a real > XID) until your standby is fully in sync. > > I'm just saying that this should be an option, not the only choice. I'm sorry, I just don't see the use case for a mode that drops guarantees when they are most needed. People who don't need those guarantees should definitely go for async replication instead. What does a synchronous replication mode that falls back to async upon failure give you, except for a severe degradation in performance during normal operation? Why not use async right away in such a case? > Depends, lots of things out there work quite well in best effort mode, > even if some projects needs more careful thinking. That's again the idea > of waiting forever or just continuing, there's a middle-ground which is > starting the system before reaching the durability requirements or > downgrading it to read only, or even off, until you get them. In such cases the admin should be free to reconfigure the quorum. And yes, a read-only mode might be feasible. Please just don't fool the admin with a "best effort" things that guarantees nothing (but trouble). > If you ask for a quorum larger than what the current standbys are able > to deliver, and you're set to wait forever until the quorum is reached, > you just blocked the master. Correct. That's the intended behavior. > Good news is that the quorum is a per-transaction setting I definitely like the per-transaction thing. > so opening a > superuser connection to act on the currently waiting transaction is > still possible (pass/fail, but fail is what at this point? shutdown to > wait some more offline?). Not sure I'm following here. The admin will be busy re-establishing (connections to) standbies, killing transactions on the master doesn't help anything - whether or not the master waits forever. Regards Markus Wanner
> Seems reasonable, but what is a CAP database? Database based around the CAP theorem[1]. Cassandra, Dynamo, Hypertable, etc. For us, the equation is: CAD, as in Consistency, Availability, Durability. Pick any two, at best. But it's a very similar bag of issues as the ones CAP addresses. [1]http://www.julianbrowne.com/article/viewer/brewers-cap-theorem -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, 2010-10-06 at 18:04 +0300, Heikki Linnakangas wrote: > The key is whether you are guaranteed to have zero data loss or not. We agree that is an important question. You seem willing to trade anything for that guarantee. I seek a more pragmatic approach that balances availability and risk. Those views are different, but not inconsistent. Oracle manages to offer multiple options and so can we. If you desire that, go for it. But don't try to stop others having a simple, pragmatic approach. The code to implement your desired option is more complex and really should come later. I don't in any way wish to block that option in this release, or any other, but please don't try to persuade people it's the only sensible option 'cos it damn well isn't. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote: > I also strongly believe that we should get single-standby > functionality committed and tested *first*, before working further on > multi-standby. Yes, lets get k = 1 first. With k = 1 the number of standbys is not limited, so we can still have very robust and highly available architectures. So we mean "first-acknowledgement-releases-waiters". > (1) Consistency: this is another DBA-false-confidence issue. DBAs who > implement (1) are liable to do so thinking that they are not only > guaranteeing the consistency of every standby with the master, but the > consistency of every standby with every other standby -- a kind of > dummy multi-master. They are not, so it will take multiple reminders > and workarounds in the docs to explain this. And we'll get complaints > anyway. This puts the matter very clearly. Setting k = N is not as good an idea as it sounds when first described. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote: > (2), (3) Degradation: (Jeff) these two cases make sense only if we > give > DBAs the tools they need to monitor which standbys are falling behind, > and to drop and replace those standbys. Otherwise we risk giving DBAs > false confidence that they have better-than-1-standby reliability when > actually they don't. Current tools are not really adequate for this. Current tools work just fine for identifying if a server is falling behind. This improved in 9.0 to give fine-grained information. Nothing more is needed here within the server. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10/06/2010 10:01 PM, Simon Riggs wrote: > The code to implement your desired option is > more complex and really should come later. I'm sorry, but I think of that exactly the opposite way. The timeout for automatic continuation after waiting for a standby is the addition. The wait state of the master is there anyway, whether or not it's bound by a timeout. The timeout option should thus come later. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: >> I'm just saying that this should be an option, not the only choice. > > I'm sorry, I just don't see the use case for a mode that drops > guarantees when they are most needed. People who don't need those > guarantees should definitely go for async replication instead. We're still talking about freezing the master and all the applications when the first standby still has to do a base backup and catch-up to where the master currently is, right? > What does a synchronous replication mode that falls back to async upon > failure give you, except for a severe degradation in performance during > normal operation? Why not use async right away in such a case? It's all about the standard case you're building, sync rep, and how to manage errors. In most cases I want flexibility. Alert says standby is down, you lost your durability requirements, so now I'm building a new standby. Does it mean my applications are all off and the master refusing to work? I sure hope I can choose about that, if possible per application. Next step, the old standby has been able to boot again, thanks to the sysadmins who repaired it, so it's online again, and my replacement machine is doing a base-backup. Are all the applications still unavailable? I sure hope I have a word in this decision. >> so opening a >> superuser connection to act on the currently waiting transaction is >> still possible (pass/fail, but fail is what at this point? shutdown to >> wait some more offline?). > > Not sure I'm following here. The admin will be busy re-establishing > (connections to) standbies, killing transactions on the master doesn't > help anything - whether or not the master waits forever. The idea here would be to be able to manually ACK a transaction that's waiting forever, because you know it won't have an answer and you'd prefer the application to just continue. But I see that's not a valid use case for you. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 07.10.2010 12:52, Dimitri Fontaine wrote: > Markus Wanner<markus@bluegap.ch> writes: >>> I'm just saying that this should be an option, not the only choice. >> >> I'm sorry, I just don't see the use case for a mode that drops >> guarantees when they are most needed. People who don't need those >> guarantees should definitely go for async replication instead. > > We're still talking about freezing the master and all the applications > when the first standby still has to do a base backup and catch-up to > where the master currently is, right? Either that, or you configure your system for asynchronous replication first, and flip the switch to synchronous only after the standby has caught up. Setting up the first standby happens only once when you initially set up the system, or if you're recovering from a catastrophic loss of the standby. >> What does a synchronous replication mode that falls back to async upon >> failure give you, except for a severe degradation in performance during >> normal operation? Why not use async right away in such a case? > > It's all about the standard case you're building, sync rep, and how to > manage errors. In most cases I want flexibility. Alert says standby is > down, you lost your durability requirements, so now I'm building a new > standby. Does it mean my applications are all off and the master > refusing to work? Yes. That's why you want to have at least two standbys if you care about availability. Or if durability isn't that important to you after all, use asynchronous replication. Of course, if in the heat of the moment the admin is willing to forge ahead without the standby, he can temporarily change the configuration in the master. If you want the standby to be rebuilt automatically, you can even incorporate that configuration change in the scripts too. The important point is that you or your scripts are in control, and you know at all times whether you can trust the standby or not. If the master makes such decisions automatically, you don't know if the standby is trustworthy (ie. guaranteed up-to-date) or not. >>> so opening a >>> superuser connection to act on the currently waiting transaction is >>> still possible (pass/fail, but fail is what at this point? shutdown to >>> wait some more offline?). >> >> Not sure I'm following here. The admin will be busy re-establishing >> (connections to) standbies, killing transactions on the master doesn't >> help anything - whether or not the master waits forever. > > The idea here would be to be able to manually ACK a transaction that's > waiting forever, because you know it won't have an answer and you'd > prefer the application to just continue. But I see that's not a valid > use case for you. I don't see anything wrong with having tools for admins to deal with the unexpected. I'm not sure overriding individual transactions is very useful though, more likely you'll want to take the whole server offline, or you want to change the config to allow all transactions to continue without the synchronous standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Either that, or you configure your system for asynchronous replication > first, and flip the switch to synchronous only after the standby has caught > up. Setting up the first standby happens only once when you initially set up > the system, or if you're recovering from a catastrophic loss of the > standby. Or if the standby is lagging and the master wal_keep_segments is not sized big enough. Is that a catastrophic loss of the standby too? >> It's all about the standard case you're building, sync rep, and how to >> manage errors. In most cases I want flexibility. Alert says standby is >> down, you lost your durability requirements, so now I'm building a new >> standby. Does it mean my applications are all off and the master >> refusing to work? > > Yes. That's why you want to have at least two standbys if you care about > availability. Or if durability isn't that important to you after all, use > asynchronous replication. Agreed, that's a nice simple use case. Another one is to say that I want sync rep when the standby is available, but I don't have the budget for more. So I prefer a good alerting system and low-budget-no-guarantee when the standby is down, that's my risk evaluation. > Of course, if in the heat of the moment the admin is willing to forge ahead > without the standby, he can temporarily change the configuration in the > master. If you want the standby to be rebuilt automatically, you can even > incorporate that configuration change in the scripts too. The important > point is that you or your scripts are in control, and you know at all times > whether you can trust the standby or not. If the master makes such decisions > automatically, you don't know if the standby is trustworthy (ie. guaranteed > up-to-date) or not. My proposal is that the master has the information to make the decision, and the behavior is something you setup. Default to security, so wait forever and block the applications, but could be set to ignore standby that have not at least reached this state. I don't see that you can make everybody happy without a knob here, and I don't see how we can deliver one without a clear state diagram of the standby possible current states and transitions. The other alternative is to just don't care and accept the timeout as being an option with the quorum, so that you just don't wait for the quorum if so you want. It's much more dynamic and dangerous, but with a good alerting system it'll be very popular I guess. > I don't see anything wrong with having tools for admins to deal with the > unexpected. I'm not sure overriding individual transactions is very useful > though, more likely you'll want to take the whole server offline, or you > want to change the config to allow all transactions to continue without the > synchronous standby. The question then is, should the new configuration alter running transactions? My implicit was that I don't think so, and then I need another facility, such as SELECT pg_cancel_quorum_wait(procpid) FROM pg_stat_activity WHERE waiting_quorum; Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, 2010-10-07 at 11:46 +0200, Markus Wanner wrote: > On 10/06/2010 10:01 PM, Simon Riggs wrote: > > The code to implement your desired option is > > more complex and really should come later. > > I'm sorry, but I think of that exactly the opposite way. I see why you say that. Dimitri's suggestion is an enhancement on the basic feature, just as Heikki's is. My reply was directed at Heikki, but should also apply to Dimitri's idea also. > The timeout for > automatic continuation after waiting for a standby is the addition. The > wait state of the master is there anyway, whether or not it's bound by a > timeout. The timeout option should thus come later. Adding timeout is very little code. We can take that out of the patch if that's an objection. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10/07/2010 01:08 PM, Simon Riggs wrote: > Adding timeout is very little code. We can take that out of the patch if > that's an objection. Okay. If you take it out, we are at the wait-forever option, right? If not, I definitely don't understand how you envision things to happen. I've been asking [1] about that distinction before, but didn't get a direct answer. Regards Markus Wanner [1]: Re: Configuring synchronous replication, Markus Wanner: http://archives.postgresql.org/message-id/4C9C5887.4040901@bluegap.ch
On Thu, Oct 7, 2010 at 3:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, lets get k = 1 first. > > With k = 1 the number of standbys is not limited, so we can still have > very robust and highly available architectures. So we mean > "first-acknowledgement-releases-waiters". +1. I like the design Greg Smith proposed yesterday (though there are details to be worked out). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Salut Dimitri, On 10/07/2010 12:32 PM, Dimitri Fontaine wrote: > Another one is to say that I want sync rep when the standby is > available, but I don't have the budget for more. So I prefer a good > alerting system and low-budget-no-guarantee when the standby is down, > that's my risk evaluation. I think that's a pretty special case, because the "good alerting system" is at least as expensive as another server that just persistently stores and ACKs incoming WAL. Why does one ever want the guarantee that sync replication gives to only hold true up to one failure, if a better guarantee doesn't cost anything extra? (Note that a "good alerting system" is impossible to achieve with only two servers. You need a third device anyway). Or put another way: a "good alerting system" is one that understands Postgres to some extent. It protects you from data loss in *every* case. If you attach at least two database servers to it, you get availability as long as any one of the two is up and running. No matter what happened before, even a full cluster power outage is guaranteed to recover from automatically without any data loss. [ Okay, the standby mode that only stores and ACKs WAL without having a full database behind still needs to be written. However, pg_streamrecv certainly goes that direction already, see [1]. ] Sync replication between really just two servers is asking for trouble and certainly not worth the savings in hardware cost. Better invest in a good UPS and redundant power supplies for a single server. > The question then is, should the new configuration alter running > transactions? It should definitely affect all currently running and waiting transactions. For anything beyond three servers, where quorum_commit could be bigger than one, it absolutely makes sense to be able to just lower the requirements temporarily, instead of having to cancel the guarantee completely. Regards Markus Wanner [1]: Using streaming replication as log archiving, Magnus Hagander http://archives.postgresql.org/message-id/AANLkTi=_BzsYT8a1KjtpWZxNWyYgqNVp1NbJWRnsD_Nv@mail.gmail.com
Markus Wanner <markus@bluegap.ch> writes: > Why does one ever want the guarantee that sync replication gives to only > hold true up to one failure, if a better guarantee doesn't cost anything > extra? (Note that a "good alerting system" is impossible to achieve with > only two servers. You need a third device anyway). I think you're all into durability, and that's good. The extra cost is service downtime if that's not what you're after: there's also availability and load balancing read queries on a system with no lag (no stale data servicing) when all is working right. I still think your use case is a solid one, but that we need to be ready to answer to some other ones, that you call relaxed and wrong because of data loss risks. My proposal is to make the risk window obvious and the behavior when you enter it configurable. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Oct 7, 2010 at 6:32 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Or if the standby is lagging and the master wal_keep_segments is not > sized big enough. Is that a catastrophic loss of the standby too? Sure, but that lagged standy is already asynchrounous, not synchrounous. If it was synchronous, it would have slowed the master down enough it would not be lagged. I'm really confused with all this k < N scenarious I see bandied about, because, all it really amounts to is "I only want *one* syncronous replication, and a bunch of synchrounous replications". And a bit of chance thrown in the mix to hope the "syncronous" one is pretty stable, the asynchronous ones aren't *too* far behind (define too and far at your leisure). And then I see a lot of posturing about how to "recover" when the "asynchronous standbys" aren't "synchronous enough" at some point... > > Agreed, that's a nice simple use case. > > Another one is to say that I want sync rep when the standby is > available, but I don't have the budget for more. So I prefer a good > alerting system and low-budget-no-guarantee when the standby is down, > that's my risk evaluation. That screems wrong in my books: "OK, I want durability, so I always want to have 2 copies of the data, but if we loose one, copy, I want to keep on trucking, because I don't *really* want durability". If you want most-of-the time mostly 2 copy durabiltiy, then really good asynchronous replication is a really good solutions. Yes, I believe you need to have a way for an admin (or process/control/config) to be able to "demote" a synchronous replication scenario into async (or "standalone", which is just an extension of really async). But it's no longer syncronous replication at that point. And if the choice is made to "keep trucking" while a new standby is being brought online and available and caught up, that's fine too. But during that perioud, until the slave is caught up and synchrounously replicating, it's *not* synchronous replication. So I'm not arguing that there shouldn't be a way to turn of synchronous replication once it's on. Hopefully without having to take down the cluster (pg instance type cluster) But I am pleading that there is a way to setup PG such that synchronous replication *is* synchronously replicating, or things stop and backup until such a time as it is. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes: > Sure, but that lagged standy is already asynchrounous, not > synchrounous. If it was synchronous, it would have slowed the master > down enough it would not be lagged. Agreed, except in the case of a joining standby. But you're saying it better than I do: > Yes, I believe you need to have a way for an admin (or > process/control/config) to be able to "demote" a synchronous > replication scenario into async (or "standalone", which is just an > extension of really async). But it's no longer syncronous replication > at that point. And if the choice is made to "keep trucking" while a > new standby is being brought online and available and caught up, > that's fine too. But during that perioud, until the slave is caught > up and synchrounously replicating, it's *not* synchronous replication. That's exactly my point. I think we need to handle the case and make it obvious that this window is a data-loss window where there's no sync rep ongoing, then offer users a choice of behaviour. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Oct 7, 2010 at 10:08 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Aidan Van Dyk <aidan@highrise.ca> writes: >> Sure, but that lagged standy is already asynchrounous, not >> synchrounous. If it was synchronous, it would have slowed the master >> down enough it would not be lagged. > > Agreed, except in the case of a joining standby. *shrug* The joining standby is still asynchronous at this point. It's not synchronous replication. It's just another ^k of the N slaves serving stale data ;-) > But you're saying it > better than I do: > >> Yes, I believe you need to have a way for an admin (or >> process/control/config) to be able to "demote" a synchronous >> replication scenario into async (or "standalone", which is just an >> extension of really async). But it's no longer syncronous replication >> at that point. And if the choice is made to "keep trucking" while a >> new standby is being brought online and available and caught up, >> that's fine too. But during that perioud, until the slave is caught >> up and synchrounously replicating, it's *not* synchronous replication. > > That's exactly my point. I think we need to handle the case and make it > obvious that this window is a data-loss window where there's no sync rep > ongoing, then offer users a choice of behaviour. Again, I'm stating there is *no* choice in synchronous replication. It's *got* to block, otherwise it's not synchronous replication. The "choice" is if you want synchronous replication or not at that point. And turning it off might be a good (best) choice for for most people. I just want to make sure that: 1) There's now way to *sensibly* think it's still "synchronously replicating" 2) There is a way to enforce that the commits happening *are* synchronously replicating. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes: > *shrug* The joining standby is still asynchronous at this point. > It's not synchronous replication. It's just another ^k of the N > slaves serving stale data ;-) Agreed *here*, but if you read the threads again, you'll see that's not at all what's been talked about before my proposal. In particular, the questions about how to unlock a master's setup while its synced standby is doing a base backup should not be allowed to exists, and you seem to agree with my point. >> That's exactly my point. I think we need to handle the case and make it >> obvious that this window is a data-loss window where there's no sync rep >> ongoing, then offer users a choice of behaviour. > > Again, I'm stating there is *no* choice in synchronous replication. > It's *got* to block, otherwise it's not synchronous replication. The > "choice" is if you want synchronous replication or not at that point. Exactly, even if I didn't dare spell it this way. What I want to propose is for the user to be able to configure things so that he loses the sync aspect of the replication if it so happens that the setup is not able to provide for it. It may sound strange, but it's needed when all you want is a no stale data reporting stanbdy, e.g. And it so happens that it's already in Simon's code, AFAIUI (yet to read it, see). > And turning it off might be a good (best) choice for for most people. > I just want to make sure that: > 1) There's now way to *sensibly* think it's still "synchronously replicating" > 2) There is a way to enforce that the commits happening *are* > synchronously replicating. We're on the same track. I don't know how to offer your options without a clear listing of standby states and transitions, which must include the synchronicity and whether you just lost it or whatnot. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Markus Wanner wrote: > I think that's a pretty special case, because the "good alerting system" > is at least as expensive as another server that just persistently stores > and ACKs incoming WAL. > The cost of hardware capable of running a database server is a large multiple of what you can build an alerting machine for. I have two systems that are approaching the trash heap just at my house, relative to the main work I do, but that are fully capable of running an alerting system. Building a production quality database server requires a more significant investment: high quality disks, ECC RAM, battery-backed RAID controller, etc. Relative to what the hardware in a database server costs, what you need to build an alerting system is almost free. Oh: and most businesses that are complicated enough to need a serious database server already have them, so they actually cost nothing beyond the software setup time to point them toward the databases, too. > Why does one ever want the guarantee that sync replication gives to only > hold true up to one failure, if a better guarantee doesn't cost anything > extra? (Note that a "good alerting system" is impossible to achieve with > only two servers. You need a third device anyway). > I do not disagree with your theory or reasoning. But as a practical matter, I'm afraid the true cost of the better guarantee you're suggesting here is additional code complexity that will likely cause this feature to miss 9.1 altogether. As far as I'm concerned, this whole diversion into the topic of quorum commit is only consuming resources away from targeting something achievable in the time frame of a single release. > Sync replication between really just two servers is asking for trouble > and certainly not worth the savings in hardware cost. Better invest in a > good UPS and redundant power supplies for a single server. > I wish I could give you the long list of data recovery projects I've worked on over the last few years, so you could really appreciate how much what you're saying here is exactly the opposite of the reality here. You cannot make a single server reliable enough to survive all of the things that Murphy's Law will inflict upon it, at any price. For most of the businesses I work with who want sync rep, data is not considered safe until the second copy is on storage miles away from the original, because they know this too. Personal anecdote I can share: I used to have an important project related to stock trading where I kept my backup system about 50 miles away from me. I was aiming for constant availability, while still being able to drive to the other server if needed for disaster recovery. Guess what? Even those two turned out not to be nearly independent enough; see http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003 for details of how I lost both of those at the same time for days. Silly me, I'd only spread them across two adjacent states with different power providers! Not nearly good enough to avoid a correlated failure. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On 10/7/10 6:41 AM, Aidan Van Dyk wrote: > I'm really confused with all this k < N scenarious I see bandied > about, because, all it really amounts to is "I only want *one* > syncronous replication, and a bunch of synchrounous replications". > And a bit of chance thrown in the mix to hope the "syncronous" one is > pretty stable, the asynchronous ones aren't *too* far behind (define > too and far at your leisure). Effectively, yes. The the difference between k of N synch rep and 1 synch standby + several async standbys is that in k of N, you have a pool and aren't dependent on having a specific standby be very reliable, just that any one of them is. So if you have k = 3 and N = 10, then you can have 10 standbys and only 3 of them need to ack any specific commit for the master to proceed. As long as (a) you retain at least one of the 3 which ack'd, and (b) you have some way of determining which standby is the most "caught up", data loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong 4, to lose data. The advantage of this for availability over just having k = N = 3 comes when one of the standbys is responding slowly (due to traffic) or goes offline unexpectedly due to a hardware failure. In the k = N = 3 case, the system halts. In the k = 3, N = 10 case, you can lose up to 7 standbys without the system going down. It's notable that the massively scalable transactional databases (Dynamo, Cassandra, various telecom databases, etc.) all operate this way. However, I do consider this "advanced" functionality and not worth pursuing until we have the k = 1 case implemented and well-tested. For comparison, Cassandra, Hypertable and Riak have been working on their k < N functionality for a couple years now and none of them has it stable *and* fast. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Oct 7, 2010 at 1:22 PM, Josh Berkus <josh@agliodbs.com> wrote: > So if you have k = 3 and N = 10, then you can have 10 standbys and only > 3 of them need to ack any specific commit for the master to proceed. As > long as (a) you retain at least one of the 3 which ack'd, and (b) you > have some way of determining which standby is the most "caught up", data > loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong > 4, to lose data. > > The advantage of this for availability over just having k = N = 3 comes > when one of the standbys is responding slowly (due to traffic) or goes > offline unexpectedly due to a hardware failure. In the k = N = 3 case, > the system halts. In the k = 3, N = 10 case, you can lose up to 7 > standbys without the system going down. Sure, but here is where I might not be following. If you want "synchronous replication" because you want "query availabilty" while making sure you're not getting "stale" queries from all your slaves, than using your k < N (k = 3 and N - 10) situation is screwing your self. To get "non-stale" responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the "caught up" slave when the master fails, but you actually need to do that for *every query*. If you say they are "pretty close so by the time you get the query to them they will be caught up", well then, all you really want is good async replication, you don't really *need* the synchronous part. The only case I see a "race to quorum" type of k < N being useful is if you're just trying to duplicate data everywhere, but not actually querying any of the replicas. I can see that "all queries go to the master, but the chances are pretty high the multiple machines are going to fail so I want >> multiple replicas" being useful, but I *don't* think that's what most people are wanting in their "I want 3 of 10 servers to ack the commit". The difference between good async and sync is only the *guarentee*. If you don't need the guarantee, you don't need the synchronous part. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
> If you want "synchronous replication" because you want "query > availabilty" while making sure you're not getting "stale" queries from > all your slaves, than using your k < N (k = 3 and N - 10) situation is > screwing your self. Correct. If that is your reason for synch standby, then you should be using k = N configuration. However, some people are willing to sacrifice consistency for durability and availability. We should give them that option (eventually), since among that triad you can never have more than two. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 10/07/2010 06:41 PM, Greg Smith wrote: > The cost of hardware capable of running a database server is a large > multiple of what you can build an alerting machine for. You realize you don't need lots of disks nor RAM for a box that only ACKs? A box with two SAS disks and a BBU isn't that expensive anymore. > I do not disagree with your theory or reasoning. But as a practical > matter, I'm afraid the true cost of the better guarantee you're > suggesting here is additional code complexity that will likely cause > this feature to miss 9.1 altogether. As far as I'm concerned, this > whole diversion into the topic of quorum commit is only consuming > resources away from targeting something achievable in the time frame of > a single release. So far I've been under the impression that Simon already has the code for quorum_commit k = 1. What I'm opposing to is the timeout "feature", which I consider to be additional code, unneeded complexity and foot-gun. > You cannot make a single server reliable enough to survive all of > the things that Murphy's Law will inflict upon it, at any price. That's exactly what I'm saying applies to two servers as well. And why a timeout is a bad thing here, because the chance the second nodes fails as well is there (and is higher than you think, according to Murphy). > For > most of the businesses I work with who want sync rep, data is not > considered safe until the second copy is on storage miles away from the > original, because they know this too. Now, that are the people who really need sync rep, yes. What do you think how happy those businesses were to find out that Postgres is cheating on them in case of a network outage, for example? Do they really value (write!) availability more than data safety? > Silly > me, I'd only spread them across two adjacent states with different power > providers! Not nearly good enough to avoid a correlated failure. Thanks for sharing this. I hope you didn't loose data. Regards Markus Wanner
> But as a practical matter, I'm afraid the true cost of the better > guarantee you're suggesting here is additional code complexity that will > likely cause this feature to miss 9.1 altogether. As far as I'm > concerned, this whole diversion into the topic of quorum commit is only > consuming resources away from targeting something achievable in the time > frame of a single release. Yes. My purpose in starting this thread was to show that k > 1 "quorum commit" is considerably more complex than the people who have been bringing it up in other threads seem to think it is. It is not achievable for 9.1, and maybe not even for 9.2. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Aidan Van Dyk <aidan@highrise.ca> wrote: > To get "non-stale" responses, you can only query those k=3 > servers. But you've shot your self in the foot because you don't > know which 3/10 those will be. The other 7 *are* stale (by > definition). They talk about picking the "caught up" slave when > the master fails, but you actually need to do that for *every > query*. With web applications, at least, you often don't care that the data read is absolutely up-to-date, as long as the point in time doesn't jump around from one request to the next. When we have used load balancing between multiple database servers (which has actually become unnecessary for us lately because PostgreSQL has gotten so darned fast!), we have established affinity between a session and one of the database servers, so that if they became slightly out of sync, data would not pop in and out of existence arbitrarily. I think a reasonable person could combine this technique with a "3 of 10" synchronous replication quorum to get both safe persistence of data and reasonable performance. I can also envision use cases where this would not be desirable. -Kevin
On Thu, Oct 7, 2010 at 2:10 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Aidan Van Dyk <aidan@highrise.ca> wrote: > >> To get "non-stale" responses, you can only query those k=3 >> servers. But you've shot your self in the foot because you don't >> know which 3/10 those will be. The other 7 *are* stale (by >> definition). They talk about picking the "caught up" slave when >> the master fails, but you actually need to do that for *every >> query*. > > With web applications, at least, you often don't care that the data > read is absolutely up-to-date, as long as the point in time doesn't > jump around from one request to the next. When we have used load > balancing between multiple database servers (which has actually > become unnecessary for us lately because PostgreSQL has gotten so > darned fast!), we have established affinity between a session and > one of the database servers, so that if they became slightly out of > sync, data would not pop in and out of existence arbitrarily. I > think a reasonable person could combine this technique with a "3 of > 10" synchronous replication quorum to get both safe persistence of > data and reasonable performance. > > I can also envision use cases where this would not be desirable. Well, keep in mind all updates have to be done on the single master. That works pretty well for fine-grained replication, but I don't think it's very good for full-cluster replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> With web applications, at least, you often don't care that the >> data read is absolutely up-to-date, as long as the point in time >> doesn't jump around from one request to the next. When we have >> used load balancing between multiple database servers (which has >> actually become unnecessary for us lately because PostgreSQL has >> gotten so darned fast!), we have established affinity between a >> session and one of the database servers, so that if they became >> slightly out of sync, data would not pop in and out of existence >> arbitrarily. I think a reasonable person could combine this >> technique with a "3 of 10" synchronous replication quorum to get >> both safe persistence of data and reasonable performance. >> >> I can also envision use cases where this would not be desirable. > > Well, keep in mind all updates have to be done on the single > master. That works pretty well for fine-grained replication, but > I don't think it's very good for full-cluster replication. I'm completely failing to understand your point here. Could you restate another way? -Kevin
On 10/07/2010 03:19 PM, Dimitri Fontaine wrote: > I think you're all into durability, and that's good. The extra cost is > service downtime It's just *reduced* availability. That doesn't necessarily mean downtime, if you combine cleverly with async replication. > if that's not what you're after: there's also > availability and load balancing read queries on a system with no lag (no > stale data servicing) when all is working right. All I'm saying is that those use cases are much better served with async replication. Maybe together with something that warns and takes action in case the standby's lag gets too big. Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. > My proposal is to make the risk window obvious and the > behavior when you enter it configurable. I don't buy that. The risk calculation gets a lot simpler and obvious with strict guarantees. Regards Markus Wanner
On 10/07/2010 07:44 PM, Aidan Van Dyk wrote: > The only case I see a "race to quorum" type of k < N being useful is > if you're just trying to duplicate data everywhere, but not actually > querying any of the replicas. I can see that "all queries go to the > master, but the chances are pretty high the multiple machines are > going to fail so I want >> multiple replicas" being useful, but I > *don't* think that's what most people are wanting in their "I want 3 > of 10 servers to ack the commit". What else do you think they want it for, if not for protection against data loss? (Note that the queries don't need to go to the master exclusively if you can live with some lag - and I think the vast majority of people can. The zero data loss guarantee holds true in any case, though). > The difference between good async and sync is only the *guarentee*. > If you don't need the guarantee, you don't need the synchronous part. Here we are exactly on the same page again. Regards Markus Wanner
On Thu, Oct 7, 2010 at 2:31 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > >>> With web applications, at least, you often don't care that the >>> data read is absolutely up-to-date, as long as the point in time >>> doesn't jump around from one request to the next. When we have >>> used load balancing between multiple database servers (which has >>> actually become unnecessary for us lately because PostgreSQL has >>> gotten so darned fast!), we have established affinity between a >>> session and one of the database servers, so that if they became >>> slightly out of sync, data would not pop in and out of existence >>> arbitrarily. I think a reasonable person could combine this >>> technique with a "3 of 10" synchronous replication quorum to get >>> both safe persistence of data and reasonable performance. >>> >>> I can also envision use cases where this would not be desirable. >> >> Well, keep in mind all updates have to be done on the single >> master. That works pretty well for fine-grained replication, but >> I don't think it's very good for full-cluster replication. > > I'm completely failing to understand your point here. Could you > restate another way? Establishing an affinity between a session and one of the database servers will only help if the traffic is strictly read-only. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Markus Wanner <markus@bluegap.ch> writes: > I don't buy that. The risk calculation gets a lot simpler and obvious > with strict guarantees. Ok, I'm lost in the use cases and analysis. I still don't understand why you want to consider the system already synchronous when it's not, whatever is the guarantee you're asking for. All I'm saying is that we should be able to know and show what the current system is up to, and we should be able to offer sane reactions in case of errors. You're calling a sane reaction blocking the master entirely when the standby ain't ready yet (it's still at the base backup state), and I can live with that. As an option. I say that either we go the lax quorum route, or we have to care for details and summary the failure cases with precision, and the possible responses with care. I don't see that possible without a clear state of each element in the system, their transitions, and a way to derive the global state of the distributed system out of that. It might be that the simpler way to go here is what Greg Smith has been proposing for a long time already, and again quite recently on this thread: have all the information you need in a system table and offer to run a user defined function to determine the state of the system. I think we managed to show what Josh Berkus wanted to know now. That's a quagmire here. Now, the problem I have is not Quorum Commit but the very definition of synchronous replication and the system we're trying to build. Not sure there's two of us wanting the same thing here. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Robert Haas <robertmhaas@gmail.com> wrote: > Establishing an affinity between a session and one of the database > servers will only help if the traffic is strictly read-only. Thanks; I now see your point. In our environment, that's pretty common. Our most heavily used web app (the one for which we have, at times, needed load balancing) connects to the database with a read-only login. Many of our web apps do their writing by posting to queues which are handled at the appropriate source database later. (I had the opportunity to use one of these "for real" last night, to fill in a juror questionnaire after receiving a summons from the jury clerk in the county where I live.) Like I said, there are sane cases for this usage, but it won't fit everybody. I have no idea on percentages. -Kevin
On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote: > To get "non-stale" responses, you can only query those k=3 servers. > But you've shot your self in the foot because you don't know which > 3/10 those will be. The other 7 *are* stale (by definition). They > talk about picking the "caught up" slave when the master fails, but > you actually need to do that for *every query*. There is a big confusion around that point and I need to point out that statement isn't accurate. It's taken me a long while to understand this. Asking for k > 1 does *not* mean those servers are time synchronised. All it means is that the master will stop waiting after 3 acknowledgements. There is no connection between the master receiving acknowledgements and the standby applying changes received from master; the standbys are all independent of one another. In a bad case, those 3 acknowledgements might happen say 5 seconds apart on the worst and best of the 3 servers. So the first standby to receive the data could have applied the changes ~4.8 seconds prior to the 3rd standby. There is still a chance of reading stale data on one standby, but reading fresh data on another server. In most cases the time window is small, but still exists. The other 7 are stale with respect to the first 3. But then so are the last 9 compared with the first one. The value of k has nothing whatsoever to do with the time difference between the master and the last standby to receive/apply the changes. The gap between first and last standby (i.e. N, not k) is the time window during which a query might/might not see a particular committed result. So standbys are eventually consistent whether or not the master relies on them to provide an acknowledgement. The only place where you can guarantee non-stale data is on the master. High values of k reduce the possibility of data loss, whereas expected cluster availability is reduced as N - k gets smaller. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-10-07 at 19:50 +0200, Markus Wanner wrote: > So far I've been under the impression that Simon already has the code > for quorum_commit k = 1. I do, but its not a parameter. The k = 1 behaviour is hardcoded and considerably simplifies the design. Moving to k > 1 is additional work, slows things down and seems likely to be fragile. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
All, > Establishing an affinity between a session and one of the database > servers will only help if the traffic is strictly read-only. I think this thread has drifted very far away from anything we're going to do for 9.1. And seems to have little to do with synchronous replication. Synch rep ensures durability. It is not, by itself, a method of ensuring consistency, nor does it pretend to be one. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Markus Wanner wrote: > So far I've been under the impression that Simon already has the code > for quorum_commit k = 1. > > What I'm opposing to is the timeout "feature", which I consider to be > additional code, unneeded complexity and foot-gun. > Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by default so that you get wait forever unless you ask for something different? Probably. Unneeded? This is where we don't agree anymore. The example that Josh Berkus just sent to the list is a typical example of what I expect people to do here. They'll use Sync Rep to maximize the odds a system failure doesn't cause any transaction loss. They'll use good quality hardware on the master so it's unlikely to fail. But when the database finds the standby unreachable, and it's left with the choice between either degrading into async rep or coming to a complete halt, you must give people the option of choosing to degrade instead after a timeout. Let them set off the red flashing lights, sound the alarms, and pray the master doesn't go down until you can fix the problem. But the choice to allow uptime concerns to win over the normal sync rep preferences, that's a completely valid business decision people will absolutely want to make in a way opposite of your personal preference here. I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On Wed, Oct 6, 2010 at 6:11 PM, Markus Wanner <markus@bluegap.ch> wrote: > Yeah, sounds more likely. Then I'm surprised that I didn't find any > warning that the Protocol C definitely reduces availability (with the > ko-count=0 default, that is). Really? I don't think that ko-count=0 means "wait-forever". IIRC, when I tried DRBD, I can write data in master's DRBD disk, without connected standby. So I think that by default the master waits for timeout and works alone when the standby goes down. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > In general, salvaging the WAL that was not sent to the standby yet is > outright impossible. You can't achieve zero data loss with asynchronous > replication at all. No. That depends on the type of failure. Unless the disk in the master has been corrupted, we might be able to salvage WAL. >> If we want only no data loss, we have only to implement the wait-forever >> option. But if we make consideration for the above-mentioned availability, >> the return-immediately option also would be required. >> >> In some (many, I think) cases, I think that we need to consider >> availability >> and no data loss together, and consider the balance of them. > > If you need both, you need three servers as Simon pointed out earlier. There > is no way around that. No. That depends on how far you'd like to ensure no data loss. Poeple who use shared disk failover solution with one master and one standby don't such a high durability. They can avoid data loss by using something like RAID to a certain extent. So it's not problem for them to run the master alone after failover happens or standby goes down. But something like RAID cannot increase availability. Synchronous replication is solution for that purpose. Of course, if we are worried about running the master alone, we can increase the number of standbys. Furthermore, if we'd like to avoid data loss from the disaster which can destroy all the servers at the same time, we might need to increase the standbys further and locate some of them in the remote site. Please imagine that "return-immediately (i.e., timeout = small)" is useful for some use cases. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 7, 2010 at 10:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> In general, salvaging the WAL that was not sent to the standby yet is >> outright impossible. You can't achieve zero data loss with asynchronous >> replication at all. > > No. That depends on the type of failure. Unless the disk in the master has > been corrupted, we might be able to salvage WAL. So I guess another way to say this is that zero data loss is unachievable, period. Greg Smith made a flip comment about having been so silly as to only put his redundant servers in adjacent states on different power grids, and yet still having an outage due to the Northeast blackouts. So what would he have had to do to completely rule out a correlated failure? Answer: It can't be done. If a massive asteroid comes zooming into the inner solar system tomorrow and hits the earth, obliterating all life, you're toast. Or likewise if nuclear war ensues. You could put your redundant server on the moon or, better yet, on a moon of one of the outer planets, but the hosting costs are pretty high and the ping times suck. So the point is that the question is not whether or not a correlated failure can happen, but whether you can imagine a scenario where a correlated failure has occurred yet you still wish you had your data. Different people will, obviously, draw that line in different places. Let's start by doing something simple that covers SOME of the cases people want, get it committed, and then move on from there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Oct 6, 2010 at 9:22 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > From my experience operating londiste, those states would be: > > 1. base-backup — self explaining > 2. catch-up — getting the WAL to catch up after base backup > 3. wanna-sync — don't yet have all the WAL to get in sync > 4. do-sync — all WALs are there, coming soon > 5. ok (async | recv | fsync | reply — feedback loop engaged) I agree to mange these standby states in a different standpoint. To avoid data loss, we must not promote the standby which is catching up with the master in half way to the new master at the failover. If clusterware can get the current standby state via SQL, it can check whether the failover causes data loss or not and give up failover before creating the trigger file. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 7, 2010 at 5:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > You seem willing to trade anything for that guarantee. I seek a more > pragmatic approach that balances availability and risk. > > Those views are different, but not inconsistent. Oracle manages to offer > multiple options and so can we. +1 Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner <markus@bluegap.ch> wrote: > Of course, it doesn't make sense to wait-forever on *every* standby that > ever gets added. Quorum commit is required, yes (and that's what this > thread is about, IIRC). But with quorum commit, adding a standby only > improves availability, but certainly doesn't block the master in any > way. But, even with quorum commit, if you choose wait-forever option, failover would decrease availability. Right after the failover, no standby has connected to new master, so if quorum >= 1, all the transactions must wait for a while. Basically we need to take a base backup from new master to start the standbys and make them connect to new master. This might take a long time. Since transaction commits cannot advance for that time, availability would goes down. Or you think that wait-forever option is applied only when the standby goes down? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-10-07 at 19:44 -0400, Greg Smith wrote: > I don't see this as needing any implementation any more complicated than > the usual way such timeouts are handled. Note how long you've been > trying to reach the standby. Default to -1 for forever. And if you hit > the timeout, mark the standby as degraded and force them to do a proper > resync when they disconnect. Once that's done, then they can re-enter > sync rep mode again, via the same process a new node would have done so. What I don't understand is why this isn't obvious to everyone. Greg this is very well put and the -hackers need to start thinking like people that actually use the database. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Fri, Oct 8, 2010 at 8:44 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by > default so that you get wait forever unless you ask for something different? > Probably. Unneeded? This is where we don't agree anymore. The example > that Josh Berkus just sent to the list is a typical example of what I expect > people to do here. They'll use Sync Rep to maximize the odds a system > failure doesn't cause any transaction loss. They'll use good quality > hardware on the master so it's unlikely to fail. But when the database > finds the standby unreachable, and it's left with the choice between either > degrading into async rep or coming to a complete halt, you must give people > the option of choosing to degrade instead after a timeout. Let them set off > the red flashing lights, sound the alarms, and pray the master doesn't go > down until you can fix the problem. But the choice to allow uptime concerns > to win over the normal sync rep preferences, that's a completely valid > business decision people will absolutely want to make in a way opposite of > your personal preference here. Definitely agreed. > I don't see this as needing any implementation any more complicated than the > usual way such timeouts are handled. Note how long you've been trying to > reach the standby. Default to -1 for forever. And if you hit the timeout, > mark the standby as degraded and force them to do a proper resync when they > disconnect. Once that's done, then they can re-enter sync rep mode again, > via the same process a new node would have done so. Fair enough. One question is when this timeout is applied. Obviously it should be applied when the standby goes down. But timeout should be applied when we initially start the master, and when no standby has not connected to new master yet after failover? I guess that people who want wait-forever would want to use "timeout = -1" for all those cases. Otherwise they cannot ensure their no data loss. OTOH, people who don't want wait-forever would not want to wait for timeout in the latter two cases. So ISTM that something like enable_wait_forever or reaction_after_timeout parameter is required separately from the timeout. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greg Smith <greg@2ndquadrant.com> writes: […] > I don't see this as needing any implementation any more complicated than the > usual way such timeouts are handled. Note how long you've been trying to > reach the standby. Default to -1 for forever. And if you hit the timeout, > mark the standby as degraded and force them to do a proper resync when they > disconnect. Once that's done, then they can re-enter sync rep mode again, > via the same process a new node would have done so. Thank you for this post, which is so much better than anything I could achieve. Just wanted to add that it should be possible in lots of cases to have a standby rejoin the party without getting as far back as taking a new base backup. Depends on wal_keep_segments and standby's degraded state, among other parameters (archives, etc). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 10/08/2010 12:30 AM, Simon Riggs wrote: > I do, but its not a parameter. The k = 1 behaviour is hardcoded and > considerably simplifies the design. Moving to k > 1 is additional work, > slows things down and seems likely to be fragile. Perfect! So I'm all in favor of committing that, but leaving away the timeout thing, which I think is just adding unneeded complexity and fragility. Regards Markus Wanner
Simon, On 10/08/2010 12:25 AM, Simon Riggs wrote: > Asking for k > 1 does *not* mean those servers are time synchronised. Yes, it's technically impossible to create a fully synchronized cluster (on the basis of shared-nothing nodes we are aiming for, that is). There always is some kind of "lag" on either side. Maybe the use case for a no-lag cluster doesn't exist, because it's technically not feasible. > In a bad case, those 3 acknowledgements might happen say 5 seconds apart > on the worst and best of the 3 servers. So the first standby to receive > the data could have applied the changes ~4.8 seconds prior to the 3rd > standby. There is still a chance of reading stale data on one standby, > but reading fresh data on another server. In most cases the time window > is small, but still exists. Well, the transaction isn't committed on the master, so one could argue it shouldn't matter. The guarantee just needs to be one way: as soon as confirmed committed to the client, all k standbies need to have it committed, too. (At least for the "apply" replication level). > So standbys are eventually consistent whether or not the master relies > on them to provide an acknowledgement. The only place where you can > guarantee non-stale data is on the master. That's formulated a bit too strong. With "apply" replication level, you should be able to rely on the guarantee that a committed transaction is visible on at least k standbies. Maybe in advance of the commit on the master, but I wouldn't call that "stale" data. Given the current proposals, the master is the one that's "lagging" the most, compared to the k standbies. > High values of k reduce the possibility of data loss, whereas expected > cluster availability is reduced as N - k gets smaller. Exactly. One addendum: a timeout increases availability at the cost of increased danger of data loss and higher complexity. Don't use it, just increase (N - k) instead. Regards Markus Wanner
On 07.10.2010 21:38, Markus Wanner wrote: > On 10/07/2010 03:19 PM, Dimitri Fontaine wrote: >> I think you're all into durability, and that's good. The extra cost is >> service downtime > > It's just *reduced* availability. That doesn't necessarily mean > downtime, if you combine cleverly with async replication. > >> if that's not what you're after: there's also >> availability and load balancing read queries on a system with no lag (no >> stale data servicing) when all is working right. > > All I'm saying is that those use cases are much better served with async > replication. Maybe together with something that warns and takes action > in case the standby's lag gets too big. > > Or what kind of customers do you think really need a no-lag solution for > read-only queries? In the LAN case, the lag of async rep is negligible > and in the WAN case the latencies of sync rep are prohibitive. There is a very good use case for that particular set up, actually. If your hot standby is guaranteed to be up-to-date with any transaction that has been committed in the master, you can use the standby interchangeably with the master for read-only queries. Very useful for load balancing. Imagine a web application that's mostly read-only, but a user can modify his own personal details like name and address, for example. Imagine that the user changes his street address and clicks 'save', causing an UPDATE, and the next query fetches that information again to display to the user. If you use load balancing, the query can be routed to the hot standby server, and if it lags even 1-2 seconds behind it's quite possible that it will still return the old address. The user will go "WTF, I just changed that!". That's the "load balancing" use case, which is quite different from the "zero data loss on server failure" use case that most people here seem to be interested in. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/08/2010 04:01 AM, Fujii Masao wrote: > Really? I don't think that ko-count=0 means "wait-forever". Telling from the documentation, I'd also say it doesn't wait forever by default. However, please note that there are different parameters for the initial wait for connection during boot up (wfc-timeout and degr-wfc-timeout). So you might to test what happens on a node failure, not just absence of a standby. Regards Markus Wanner
On 08.10.2010 06:41, Fujii Masao wrote: > On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner<markus@bluegap.ch> wrote: >> Of course, it doesn't make sense to wait-forever on *every* standby that >> ever gets added. Quorum commit is required, yes (and that's what this >> thread is about, IIRC). But with quorum commit, adding a standby only >> improves availability, but certainly doesn't block the master in any >> way. > > But, even with quorum commit, if you choose wait-forever option, > failover would decrease availability. Right after the failover, > no standby has connected to new master, so if quorum>= 1, all > the transactions must wait for a while. Sure, the new master can't proceed with commits until enough standbys have connected to it. > Basically we need to take a base backup from new master to start > the standbys and make them connect to new master. Do we really need that? I don't think that's acceptable, we'll need to fix that if that's the case. I think you're right, streaming replication doesn't work across timeline changes. We left that out of 9.0, to keep things simple, but it seems that we really should fix that for 9.1. You can cross timelines with the archive, though. But IIRC there was some issue with that too, you needed to restart the standbys because the standby scans what timelines exist at the beginning of recovery, and won't notice new timelines that appear after that? We need to address that, apart from any of the other things discussed wrt. synchronous replication. It will benefit asynchronous replication too. IMHO *that* is the next thing we should do, the next patch we commit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-10-08 at 09:52 +0200, Markus Wanner wrote: > One addendum: a timeout increases availability at the cost of > increased danger of data loss and higher complexity. Don't use it, > just increase (N - k) instead. Completely agree. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10/08/2010 05:41 AM, Fujii Masao wrote: > But, even with quorum commit, if you choose wait-forever option, > failover would decrease availability. Right after the failover, > no standby has connected to new master, so if quorum >= 1, all > the transactions must wait for a while. That's a point, yes. But again, this is just write-availability, you can happily read from all active standbies. And connection time is certainly negligible compared to any kind of timeout (which certainly needs to be way bigger than a couple of network round-trips). > Basically we need to take a base backup from new master to start > the standbys and make them connect to new master. This might take > a long time. Since transaction commits cannot advance for that time, > availability would goes down. Just don't increase your quorum_commit to unreasonable values which your hardware cannot possible satisfy. It doesn't make sense to set a quorum_commit of 1 or even bigger, if you don't already have a standby attached. Start with 0 (i.e. replication off), then add standbies, then increase quorum_commit to your new requirements. > Or you think that wait-forever option is applied only when the > standby goes down? That wouldn't work in case of a full-cluster crash, where the wait-forever option is required again. Otherwise you risk a split-brain situation. Regards Markus Wanner
On 08.10.2010 01:25, Simon Riggs wrote: > On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote: > >> To get "non-stale" responses, you can only query those k=3 servers. >> But you've shot your self in the foot because you don't know which >> 3/10 those will be. The other 7 *are* stale (by definition). They >> talk about picking the "caught up" slave when the master fails, but >> you actually need to do that for *every query*. > > There is a big confusion around that point and I need to point out that > statement isn't accurate. It's taken me a long while to understand this. > > Asking for k> 1 does *not* mean those servers are time synchronised. > All it means is that the master will stop waiting after 3 > acknowledgements. There is no connection between the master receiving > acknowledgements and the standby applying changes received from master; > the standbys are all independent of one another. > > In a bad case, those 3 acknowledgements might happen say 5 seconds apart > on the worst and best of the 3 servers. So the first standby to receive > the data could have applied the changes ~4.8 seconds prior to the 3rd > standby. There is still a chance of reading stale data on one standby, > but reading fresh data on another server. In most cases the time window > is small, but still exists. > > The other 7 are stale with respect to the first 3. But then so are the > last 9 compared with the first one. The value of k has nothing > whatsoever to do with the time difference between the master and the > last standby to receive/apply the changes. The gap between first and > last standby (i.e. N, not k) is the time window during which a query > might/might not see a particular committed result. > > So standbys are eventually consistent whether or not the master relies > on them to provide an acknowledgement. The only place where you can > guarantee non-stale data is on the master. Yes, that's a good point. Synchronous replication for load-balancing purposes guarantees that when *you* perform a commit, after it finishes it will be visible in all standbys. But if you run the same query across different standbys, you're not guaranteed get same results. If you just pick a random server for every query, you might even see time moving backwards. Affinity is definitely a good idea for the load-balancing scenario, but even then the anomaly is possible if you get re-routed to a different server because the one you were bound to dies. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: > > > > Or what kind of customers do you think really need a no-lag solution for > > read-only queries? In the LAN case, the lag of async rep is negligible > > and in the WAN case the latencies of sync rep are prohibitive. > > There is a very good use case for that particular set up, actually. If > your hot standby is guaranteed to be up-to-date with any transaction > that has been committed in the master, you can use the standby > interchangeably with the master for read-only queries. This is an important point. It is desirable, but there is no such thing. We must not take any project decisions based upon that false premise. Hot Standby is never guaranteed to be up-to-date with master. There is no such thing as certainty that you have the same data as the master. All sync rep gives you is a better durability guarantee that the changes are safe. It doesn't guarantee those changes are transferred to all nodes prior to making the data changes on any one standby. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 08.10.2010 11:25, Simon Riggs wrote: > On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: >>> >>> Or what kind of customers do you think really need a no-lag solution for >>> read-only queries? In the LAN case, the lag of async rep is negligible >>> and in the WAN case the latencies of sync rep are prohibitive. >> >> There is a very good use case for that particular set up, actually. If >> your hot standby is guaranteed to be up-to-date with any transaction >> that has been committed in the master, you can use the standby >> interchangeably with the master for read-only queries. > > This is an important point. It is desirable, but there is no such thing. > We must not take any project decisions based upon that false premise. > > Hot Standby is never guaranteed to be up-to-date with master. There is > no such thing as certainty that you have the same data as the master. Synchronous replication in the 'replay' mode is supposed to guarantee exactly that, no? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 10/08/2010 10:27 AM, Heikki Linnakangas wrote: > Synchronous replication in the 'replay' mode is supposed to guarantee > exactly that, no? The master may lag behind, so it's not strictly speaking the same data. Regards Markus Wanner
On 10/08/2010 09:56 AM, Heikki Linnakangas wrote: > Imagine a web application that's mostly read-only, but a > user can modify his own personal details like name and address, for > example. Imagine that the user changes his street address and clicks > 'save', causing an UPDATE, and the next query fetches that information > again to display to the user. I don't think that use case justifies sync replication and the additional network overhead that brings. Latency is low in that case, okay, but so is the lag for async replication. Why not tell the load balancer to read from the master for n seconds after the last write. After that, it should be save to query standbies, again. If the load on the master is the problem, and you want to reduce that by moving the read-only transactions to the slave, sync replication pretty certainly won't help you, either, because it actually *increases* concurrency (by increased commit latency). Regards Markus Wanner
On Fri, 2010-10-08 at 11:27 +0300, Heikki Linnakangas wrote: > On 08.10.2010 11:25, Simon Riggs wrote: > > On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: > >>> > >>> Or what kind of customers do you think really need a no-lag solution for > >>> read-only queries? In the LAN case, the lag of async rep is negligible > >>> and in the WAN case the latencies of sync rep are prohibitive. > >> > >> There is a very good use case for that particular set up, actually. If > >> your hot standby is guaranteed to be up-to-date with any transaction > >> that has been committed in the master, you can use the standby > >> interchangeably with the master for read-only queries. > > > > This is an important point. It is desirable, but there is no such thing. > > We must not take any project decisions based upon that false premise. > > > > Hot Standby is never guaranteed to be up-to-date with master. There is > > no such thing as certainty that you have the same data as the master. > > Synchronous replication in the 'replay' mode is supposed to guarantee > exactly that, no? >From the perspective of the person making the change on the master: yes. If they make the change, wait for commit, then check the value on a standby, yes it will be there (or a later version). >From the perspective of an observer, randomly selecting a standby for load balancing purposes: No, they are not guaranteed to see the "latest" answer, nor even can they find out whether what they are seeing is the latest answer. What sync rep does guarantee is that if the person making the change is told it succeeded (commit) then that change is safe on at least k other servers. Sync rep is about guarantees of safety, not observability. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10/08/2010 01:44 AM, Greg Smith wrote: > They'll use Sync Rep to maximize > the odds a system failure doesn't cause any transaction loss. They'll > use good quality hardware on the master so it's unlikely to fail. .."unlikely to fail"? Ehm.. is that you speaking, Greg? ;-) > But > when the database finds the standby unreachable, and it's left with the > choice between either degrading into async rep or coming to a complete > halt, you must give people the option of choosing to degrade instead > after a timeout. Let them set off the red flashing lights, sound the > alarms, and pray the master doesn't go down until you can fix the > problem. Okay, okay, fair enough - if there had been red flashing lights. And alarms. And bells and whistles. But that's what I'm afraid the timeout is removing. > I don't see this as needing any implementation any more complicated than > the usual way such timeouts are handled. Note how long you've been > trying to reach the standby. Default to -1 for forever. And if you hit > the timeout, mark the standby as degraded ..and how do you make sure you are not marking your second standby as degraded just because it's currently lagging? Effectively degrading the utterly needed one, because your first standby has just bitten the dust? And how do you prevent the split brain situation in case the master dies shortly after these events, but fails to come up again immediately? Your list of data recovery projects will get larger and the projects more complicated. Because there's a lot more to it than just the implementation of a timeout. Regards Markus Wanner
On 10/08/2010 11:00 AM, Simon Riggs wrote: > From the perspective of an observer, randomly selecting a standby for > load balancing purposes: No, they are not guaranteed to see the "latest" > answer, nor even can they find out whether what they are seeing is the > latest answer. I completely agree. The application (or at least the load balancer) needs to be aware of that fact. Regards Markus
Markus Wanner <markus@bluegap.ch> writes: > ..and how do you make sure you are not marking your second standby as > degraded just because it's currently lagging? Well, in sync rep, a standby that's not able to stay under the timeout is degraded. Full stop. The presence of the timeout (or its value not being -1) means that the admin has chosen this definition. > Effectively degrading the > utterly needed one, because your first standby has just bitten the > dust? Well, now you have a worst case scenario: first standby is dead and the remaining one was not able to keep up. You have lost all your master's failover replacements. > And how do you prevent the split brain situation in case the master dies > shortly after these events, but fails to come up again immediately? Same old story. Either you're able to try and fix the master so that you don't lose any data and don't even have to check for that, or you take a risk and start from a non synced standby. It's all availability against durability again. What I really want us to be able to provide is the clear facts so that whoever has to take the decision is able to. Meaning, here, that it should be easy to see that neither the standby are in sync at this point. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 10/08/2010 11:41 AM, Dimitri Fontaine wrote: > Same old story. Either you're able to try and fix the master so that you > don't lose any data and don't even have to check for that, or you take a > risk and start from a non synced standby. It's all availability against > durability again. ..and a whole lot of manual work, that's prone to error for something that could easily be automated, at certainly less than 2000 EUR initial, additional cost (if any at all, in case you already have three servers). Sorry, I still fail to understand that use case. It reminds me of the customer that wanted to save the cost of the BBU and ran with fsync=off. Until his server got down due to a power outage. But yeah, we provide that option as well, yes. Point taken. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > ..and a whole lot of manual work, that's prone to error for something > that could easily be automated So, the master just crashed, first standby is dead and second ain't in sync. What's the easy and automated way out? Sorry, I need a hand here. -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Fri, Oct 8, 2010 at 5:07 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 10/08/2010 04:01 AM, Fujii Masao wrote: >> Really? I don't think that ko-count=0 means "wait-forever". > > Telling from the documentation, I'd also say it doesn't wait forever by > default. However, please note that there are different parameters for > the initial wait for connection during boot up (wfc-timeout and > degr-wfc-timeout). So you might to test what happens on a node failure, > not just absence of a standby. Unfortunately I've already taken down my DRBD environment. As far as I heard from my colleague who is familiar with DRBD, standby node failure doesn't prevent the master from writing data to the DRBD disk by default. If there is DRBD environment available around me, I'll try the test. And, I'd like to know whether the master waits forever because of the standby failure in other solutions such as Oracle DataGuard, MySQL semi-synchronous replication. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greg Smith <greg@2ndquadrant.com> writes: > I don't see this as needing any implementation any more complicated than > the usual way such timeouts are handled. Note how long you've been > trying to reach the standby. Default to -1 for forever. And if you hit > the timeout, mark the standby as degraded and force them to do a proper > resync when they disconnect. Once that's done, then they can re-enter > sync rep mode again, via the same process a new node would have done so. Well, actually, that's *considerably* more complicated than just a timeout. How are you going to "mark the standby as degraded"? The standby can't keep that information, because it's not even connected when the master makes the decision. ISTM that this requires 1. a unique identifier for each standby (not just role names that multiple standbys might share); 2. state on the master associated with each possible standby -- not just the ones currently connected. Both of those are perhaps possible, but the sense I have of the discussion is that people want to avoid them. Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? Looks to me like it has to be reliable non-replicated storage. Leaving aside the question of how reliable it can really be if not replicated, it's still the case that we have noplace to put such information given the WAL-is-across-the-whole-cluster design. regards, tom lane
On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Do we really need that? Yes. But if there is no unsent WAL when the master goes down, we can start new standby without new backup by copying the timeline history file from new master to new standby and setting recovery_target_timeline to 'latest'. In this case, new standby advances the recovery to the latest timeline ID which new master uses before connecting to the master. This seems to have been successful in my test environment. Though I'm missing something. > I don't think that's acceptable, we'll need to fix > that if that's the case. Agreed. > You can cross timelines with the archive, though. But IIRC there was some > issue with that too, you needed to restart the standbys because the standby > scans what timelines exist at the beginning of recovery, and won't notice > new timelines that appear after that? Yes. > We need to address that, apart from any of the other things discussed wrt. > synchronous replication. It will benefit asynchronous replication too. IMHO > *that* is the next thing we should do, the next patch we commit. You mean to commit that capability before synchronous replication? If so, I disagree with you. I think that it's not easy to address that problem. So I'm worried about that implementing that capability first means the miss of sync rep in 9.1. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 10/08/2010 04:11 PM, Tom Lane wrote: > Actually, #2 seems rather difficult even if you want it. Presumably > you'd like to keep that state in reliable storage, so it survives master > crashes. But how you gonna commit a change to that state, if you just > lost every standby (suppose master's ethernet cable got unplugged)? IIUC you seem to assume that the master node keeps its master role. But users who value availability a lot certainly want automatic fail-over, so any node can potentially be the new master. After recovery from a full-cluster outage, the first question is which node was the most recent master (or which former standby is up to date and could take over). Regards Markus Wanner
Tom Lane <tgl@sss.pgh.pa.us> writes: > Well, actually, that's *considerably* more complicated than just a > timeout. How are you going to "mark the standby as degraded"? The > standby can't keep that information, because it's not even connected > when the master makes the decision. ISTM that this requires > > 1. a unique identifier for each standby (not just role names that > multiple standbys might share); > > 2. state on the master associated with each possible standby -- not just > the ones currently connected. > > Both of those are perhaps possible, but the sense I have of the > discussion is that people want to avoid them. What we'd like to avoid is for the users to have to cope with such needs. Now, if that's internal to the code and automatic, that's not the same thing at all. What I'd have in mind is a "Database standby system identifier" that would be part of the initial hand shake in the replication protocol. And a system function to be able to "unregister" the standby. > Actually, #2 seems rather difficult even if you want it. Presumably > you'd like to keep that state in reliable storage, so it survives master > crashes. But how you gonna commit a change to that state, if you just > lost every standby (suppose master's ethernet cable got unplugged)? I don't see that as a huge problem myself, because I'm already well sold to the per-transaction replication-synchronous behaviour. So any change done there by the master would be hard-coded as async. What I'm missing? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 10/08/2010 12:05 PM, Dimitri Fontaine wrote: > Markus Wanner <markus@bluegap.ch> writes: >> ..and a whole lot of manual work, that's prone to error for something >> that could easily be automated > > So, the master just crashed, first standby is dead and second ain't in > sync. What's the easy and automated way out? Sorry, I need a hand here. Thinking this through, I'm realizing that this can potentially work automatically with three nodes in both cases. Each node needs to keep track of whether or not it is (or became) the master - and when (lamport timestamp, maybe, not necessarily wall clock). A new master might continue to commit new transactions after a fail-over, without the old master being able to record that fact (because it's down). This means there's a different requirement after a full-cluster crash (i.e. master failure and no up-to-date standby is available). With the timeout, you absolutely need the former master to come back up again for zero data loss, no matter what your quorum_commit setting was. To be able to automatically tell who was the most recent master, you need to query the state of all other nodes, because they could be a more recent master. If that's not possible (or not feasible, because the replacement part isn't currently available), you are at risk of data loss. With the given three node scenario, the zero data loss guarantee only holds true as long as either at least one node (that is in sync) is running or if you can recover the former master after a full cluster crash. When waiting forever, you only need one of the k nodes to come back up again. You also need to query other nodes to find out which the k of N nodes are, but being able to recovery (N - k + 1) nodes is sufficient to figure that out. So any (k-1) nodes may fail, even permanently, at any point in time, and you are still not at risk of losing data. (Nor at risk of losing availability, BTW). I'm still of the opinion that that's the way easier and clearer guarantee. Also note that with higher values for N, this gets more and more important, because the chance to be able to recovery all N nodes after a full crash shrinks with increasing N (while the time required to do so increases). But maybe the current sync rep feature doesn't need to target setups with that many nodes. I certainly agree that either way is complicated to implement. With Postgtres-R, I'm clearly going the way that's able to satisfy large numbers of nodes. Thanks for an interesting discussion. And for respectful disagreement. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > On 10/08/2010 04:11 PM, Tom Lane wrote: >> Actually, #2 seems rather difficult even if you want it. Presumably >> you'd like to keep that state in reliable storage, so it survives master >> crashes. But how you gonna commit a change to that state, if you just >> lost every standby (suppose master's ethernet cable got unplugged)? > IIUC you seem to assume that the master node keeps its master role. But > users who value availability a lot certainly want automatic fail-over, Huh? Surely loss of the slaves shouldn't force a failover. Maybe the slaves really are all dead. regards, tom lane
On 10/08/2010 04:38 PM, Tom Lane wrote: > Markus Wanner <markus@bluegap.ch> writes: >> IIUC you seem to assume that the master node keeps its master role. But >> users who value availability a lot certainly want automatic fail-over, > > Huh? Surely loss of the slaves shouldn't force a failover. Maybe the > slaves really are all dead. I think we are talking across each other. I'm speaking about the need to be able to fail-over to a standby in case the master fails. In case of a full-cluster crash after such a fail-over, you need to take care you don't enter split brain. Some kind of STONITH, lamport clock, or what not. Figuring out which node has been the most recent (and thus most up to date) master is far from trivial. (See also my mail in answer to Dimitri a few minutes ago). Regards Markus Wanner
On Fri, 2010-10-08 at 10:11 -0400, Tom Lane wrote: > 1. a unique identifier for each standby (not just role names that > multiple standbys might share); That is difficult because each standby is identical. If a standby goes down, people can regenerate a new standby by taking a copy from another standby. What number do we give this new standby?... > 2. state on the master associated with each possible standby -- not just > the ones currently connected. > > Both of those are perhaps possible, but the sense I have of the > discussion is that people want to avoid them. Yes, I really want to avoid such issues and likely complexities we get into trying to solve them. In reality they should not be common because it only happens if the sysadmin has not configured sufficient number of redundant standbys. My proposed design is that the timeout does not cause the standby to be "marked as degraded". It is up to the user to decide whether they wait, or whether they progress without sync rep. Or sysadmin can release the waiters via a function call. If the cluster does become degraded the sysadmin just generates a new standby and plugs in back into the cluster and away we go. Simple, no state to be recorded and no state to get screwed up either. I don't think we should be spending too much time trying to help people that say they want additional durability guarantees but do not match that with sufficient hardware resources to make it happen smoothly. If we do try to tackle those problems who will be able to validate our code actually works? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, Oct 8, 2010 at 5:16 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 10/08/2010 05:41 AM, Fujii Masao wrote: >> But, even with quorum commit, if you choose wait-forever option, >> failover would decrease availability. Right after the failover, >> no standby has connected to new master, so if quorum >= 1, all >> the transactions must wait for a while. > > That's a point, yes. But again, this is just write-availability, you can > happily read from all active standbies. I believe many systems require write-availability. >> Basically we need to take a base backup from new master to start >> the standbys and make them connect to new master. This might take >> a long time. Since transaction commits cannot advance for that time, >> availability would goes down. > > Just don't increase your quorum_commit to unreasonable values which your > hardware cannot possible satisfy. It doesn't make sense to set a > quorum_commit of 1 or even bigger, if you don't already have a standby > attached. > > Start with 0 (i.e. replication off), then add standbies, then increase > quorum_commit to your new requirements. No. This only makes the procedure of failover more complex. >> Or you think that wait-forever option is applied only when the >> standby goes down? > > That wouldn't work in case of a full-cluster crash, where the > wait-forever option is required again. Otherwise you risk a split-brain > situation. What is a full-cluster crash? Why does it cause a split-brain? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >From the perspective of an observer, randomly selecting a standby for > load balancing purposes: No, they are not guaranteed to see the "latest" > answer, nor even can they find out whether what they are seeing is the > latest answer. To guarantee that each standby returns the same result, we would need to use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides that feature. Though I'm not sure if it can be applied in HS/SR. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2010-10-08 at 23:55 +0900, Fujii Masao wrote: > On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > >From the perspective of an observer, randomly selecting a standby for > > load balancing purposes: No, they are not guaranteed to see the "latest" > > answer, nor even can they find out whether what they are seeing is the > > latest answer. > > To guarantee that each standby returns the same result, we would need to > use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides > that feature. Though I'm not sure if it can be applied in HS/SR. That is my understanding. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10/08/2010 04:47 PM, Simon Riggs wrote: > Yes, I really want to avoid such issues and likely complexities we get > into trying to solve them. In reality they should not be common because > it only happens if the sysadmin has not configured sufficient number of > redundant standbys. Well, full cluster outages are infrequent, but sadly cannot be avoided entirely. (Murphy's laughing). IMO we should be prepared to deal with those. Or am I understanding you wrongly here? > I don't > think we should be spending too much time trying to help people that say > they want additional durability guarantees but do not match that with > sufficient hardware resources to make it happen smoothly. I fully agree to that statement. Regards Markus Wanner
On 10/08/2010 04:48 PM, Fujii Masao wrote: > I believe many systems require write-availability. Sure. Make sure you have enough standbies to fail over to. (I think there are even more situations where read-availability is much more important, though). >> Start with 0 (i.e. replication off), then add standbies, then increase >> quorum_commit to your new requirements. > > No. This only makes the procedure of failover more complex. Huh? This doesn't affect fail-over at all. Quite the opposite, the guarantees and requirements remain the same even after a fail-over. > What is a full-cluster crash? The event that all of your cluster nodes are down (most probably due to power failure, but fires or other catastrophic events can be other causes). Chances for that to happen can certainly be reduced by distributing to distant locations, but that equally certainly increases latency, which isn't always an option. > Why does it cause a split-brain? First master node A fails, a standby B takes over, but then fails as well. Let node C take over. Then the power aggregates catches fire, the infamous full-cluster crash (where "lights out management" gets a completely new meaning ;-) ). Split brain would be the situation that arises if all three nodes (A, B and C) start up again and think they have been the former master, so they can now continue to apply new transactions. Their data diverges, leading to what could be seen as a split-brain from the outside. Obviously, you must disallow A and B to take the role of the master after recovery. Ideally, C would continue as the master. However, if the fire destroyed node C, let's hope you had another (sync!) standby that can act as the new master. Otherwise you've lost data. Hope that explains it. Wikipedia certainly provides a better (and less Postgres colored) explanation. Regards Markus Wanner
> And, I'd like to know whether the master waits forever because of the > standby failure in other solutions such as Oracle DataGuard, MySQL > semi-synchronous replication. MySQL used to be fond of simiply failing sliently. Not sure what 5.4 does, or Oracle. In any case MySQL's replication has always really been async (except Cluster, which is a very different database), so it's not really a comparison. Here's the comparables: Oracle DataGuard DRBD SQL Server DB2 If anyone knows what the above do by default, please speak up! -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
* On 10/8/10, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Do we really need that? > > Yes. But if there is no unsent WAL when the master goes down, > we can start new standby without new backup by copying the > timeline history file from new master to new standby and > setting recovery_target_timeline to 'latest'. In this case, > new standby advances the recovery to the latest timeline ID > which new master uses before connecting to the master. > > This seems to have been successful in my test environment. > Though I'm missing something. > >> I don't think that's acceptable, we'll need to fix >> that if that's the case. > > Agreed. > >> You can cross timelines with the archive, though. But IIRC there was some >> issue with that too, you needed to restart the standbys because the >> standby >> scans what timelines exist at the beginning of recovery, and won't notice >> new timelines that appear after that? > > Yes. > >> We need to address that, apart from any of the other things discussed wrt. >> synchronous replication. It will benefit asynchronous replication too. >> IMHO >> *that* is the next thing we should do, the next patch we commit. > > You mean to commit that capability before synchronous replication? If so, > I disagree with you. I think that it's not easy to address that problem. > So I'm worried about that implementing that capability first means the miss > of sync rep in 9.1. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rob Wultsch wultsch@gmail.com
On 08.10.2010 17:26, Fujii Masao wrote: > On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Do we really need that? > > Yes. But if there is no unsent WAL when the master goes down, > we can start new standby without new backup by copying the > timeline history file from new master to new standby and > setting recovery_target_timeline to 'latest'. .. and restart the standby. > In this case, > new standby advances the recovery to the latest timeline ID > which new master uses before connecting to the master. > > This seems to have been successful in my test environment. > Though I'm missing something. Yeah, that should work, but it's awfully complicated. >> I don't think that's acceptable, we'll need to fix >> that if that's the case. > > Agreed. > >> You can cross timelines with the archive, though. But IIRC there was some >> issue with that too, you needed to restart the standbys because the standby >> scans what timelines exist at the beginning of recovery, and won't notice >> new timelines that appear after that? > > Yes. > >> We need to address that, apart from any of the other things discussed wrt. >> synchronous replication. It will benefit asynchronous replication too. IMHO >> *that* is the next thing we should do, the next patch we commit. > > You mean to commit that capability before synchronous replication? If so, > I disagree with you. I think that it's not easy to address that problem. > So I'm worried about that implementing that capability first means the miss > of sync rep in 9.1. It's a pretty severe shortcoming at the moment. For starters, it means that you need a shared archive, even if you set wal_keep_segments to a high number. Secondly, it's a lot of scripting to get it working, I don't like the thought of testing failovers in synchronous replication if I have to do all that. Frankly, this seems more important to me than synchronous replication. It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Markus Wanner wrote: > ..and how do you make sure you are not marking your second standby as > degraded just because it's currently lagging? Effectively degrading the > utterly needed one, because your first standby has just bitten the dust? > People are going to monitor the standby lag. If it gets excessive relative to where it's approaching the known timeout, the flashing yellow lights should go off at this point, before it gets this bad. And if you've set a reasonable business oriented timeout on how long you can stand for the master to be held up waiting for a lagging standby, the right thing to do may very well be to cut it off. At some point people will want to stop waiting for a standby if it's taking so long to commit that it's interfering with the ability of the master to operate normally. Such a master is already degraded, if your performance metrics for availability includes processing transactions in a timely manner. > And how do you prevent the split brain situation in case the master dies > shortly after these events, but fails to come up again immediately? > How is that a new problem? It's already possible to end up with a standby pair that has suffered through some bizarre failure chain such that it's not necessarily obvious which of the two systems has the most recent set of data on it. And that's not this project's problem to solve. Useful answers to the split brain problem involve fencing implementations that normally drop to the hardware level, and clustering solutions including those features are already available that PostgreSQL can integrate into. Assuming you have to solve this in order to deliver a useful database replication component is excessively ambitious. You seem to be under the assumption that a more complicated replication implementation here will make reaching a bad state impossible. I think that's optimistic, both in theory and in regards to how successful code gets built. Here's the thing: the difficultly of testing to prove your code actually works is also proportional to that complexity. This project can chose to commit and potentially ship a simple solution that has known limitations, and expect that people will fill in the gap with existing add-on software to handle the clustering parts it doesn't: fencing, virtual IP address assignment, etc. All while getting useful testing feedback on the simple bottom layer, whose main purpose in life is to transport WAL data synchronously. Or, we can argue in favor of adding additional complexity on top first instead, so we end up with layers and layers of untested code. That path leads to situations where you're lucky to ship at all, and when you do the result is difficult to support. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Tom Lane wrote: > How are you going to "mark the standby as degraded"? The > standby can't keep that information, because it's not even connected > when the master makes the decision. From a high level, I'm assuming only that the master has a list in memory of the standby system(s) it believes are up to date, and that it is supposed to commit to synchronously. When I say mark as degraded, I mean that the master merely closes whatever communications channel it had open with that system and removes the standby from that list. If that standby now reconnects again, I don't see how resolving what happens at that point is any different from when a standby is first started after both systems were turned off. If the standby is current with the data available on the master when it has an initial conversation, great; it's now available for synchronous commit too then. If it's not, it goes into a catchup mode first instead. When the master sees you're back to current again, if you're on the list of sync servers too you go back onto the list of active sync systems. There's shouldn't be any state information to save here. If the master and standby can't figure out if they are in or out of sync with one another based on the conversation they have when they first connect to one another, that suggests to me there needs to be improvements made in the communications protocol they use to exchange messages. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote: > Well, full cluster outages are infrequent, but sadly cannot be avoided > entirely. (Murphy's laughing). IMO we should be prepared to deal with > those. I've described how I propose to deal with those. I'm not waving away these issues, just proposing that we consciously choose simplicity and therefore robustness. Let me say it again for clarity. (This is written for the general case, though my patch uses only k=1 i.e. one acknowledgement): If we want robustness, we have multiple standbys. So if you lose one, you continue as normal without interruption. That is the first and most important line of defence - not software. When we start to wait, if there aren't sufficient active standbys to acknowledge a commit, then the commit won't wait. This behaviour helps us avoid situations where we are hours or days away from having a working standby to acknowledge the commit. We've had a long debate about servers that "ought to be there" but aren't; I suggest we treat standbys that aren't there as having a strong possibility they won't come back, and hence not worth waiting for. Heikki disagrees; I have no problem with adding server registration so that we can add additional waits, but I doubt that the majority of users prefer waiting over availability. It can be an option Once we are waiting, if insufficient standbys acknowledge the commit we will wait until the timeout expires, after which we commit and continue working. If you don't like timeouts, set the timeout to 0 to wait forever. This behaviour is designed to emphasise availability. (I acknowledge that some people are so worried by data loss that they would choose to stop changes altogether, and accept unavailability; I regard that as a minority use case, but one which I would not argue against including as an options at some point in the future.) To cover Dimitri's observation that when a streaming standby first connects it might take some time before it can sensibly acknowledge, we don't activate the standby until it has caught up. Once caught up, it will advertise it's capability to offer a sync rep service. Standbys that don't wish to be failover targets can set synchronous_replication_service = off. The paths between servers aren't defined explicitly, so the parameters all still work even after failover. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-10-08 at 16:34 -0400, Greg Smith wrote: > Tom Lane wrote: > > How are you going to "mark the standby as degraded"? The > > standby can't keep that information, because it's not even connected > > when the master makes the decision. > > From a high level, I'm assuming only that the master has a list in > memory of the standby system(s) it believes are up to date, and that it > is supposed to commit to synchronously. When I say mark as degraded, I > mean that the master merely closes whatever communications channel it > had open with that system and removes the standby from that list. My current coding works with two sets of parameters: The "master marks standby as degraded" is handled by the tcp keepalives. When it notices no response, it kicks out the standby. We already had this, so I never mentioned it before as being part of the solution. The second part is the synchronous_replication_timeout which is a user settable parameter defining how long the app is prepared to wait, which could be more or less time than the keepalives. > If that standby now reconnects again, I don't see how resolving what > happens at that point is any different from when a standby is first > started after both systems were turned off. If the standby is current > with the data available on the master when it has an initial > conversation, great; it's now available for synchronous commit too > then. If it's not, it goes into a catchup mode first instead. When the > master sees you're back to current again, if you're on the list of sync > servers too you go back onto the list of active sync systems. > > There's shouldn't be any state information to save here. If the master > and standby can't figure out if they are in or out of sync with one > another based on the conversation they have when they first connect to > one another, that suggests to me there needs to be improvements made in > the communications protocol they use to exchange messages. Agreed. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Greg, to me it looks like we have very similar goals, but start from different preconditions. I absolutely agree with you given the preconditions you named. On 10/08/2010 10:04 PM, Greg Smith wrote: > How is that a new problem? It's already possible to end up with a > standby pair that has suffered through some bizarre failure chain such > that it's not necessarily obvious which of the two systems has the most > recent set of data on it. And that's not this project's problem to > solve. Thanks for pointing that out. I think that might not have been clear to me. This limitation of scope certainly make sense for the Postgres project in general. Regards Markus Wanner
On Sat, Oct 9, 2010 at 12:12 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 10/08/2010 04:48 PM, Fujii Masao wrote: >> I believe many systems require write-availability. > > Sure. Make sure you have enough standbies to fail over to. Unfortunately even enough standbys don't increase write-availability unless you choose wait-forever. Because, after promoting one of standbys to new master, you must keep all the transactions waiting until at least one standby has connected to and caught up with new master. Currently this wait time is not short. > (I think there are even more situations where read-availability is much > more important, though). Even so, we should not ignore the write-availability aspect. >>> Start with 0 (i.e. replication off), then add standbies, then increase >>> quorum_commit to your new requirements. >> >> No. This only makes the procedure of failover more complex. > > Huh? This doesn't affect fail-over at all. Quite the opposite, the > guarantees and requirements remain the same even after a fail-over. Hmm.. that increases the number of procedures which the users must perform at the failover. At least, the users seem to have to wait until the standby has caught up with new master, increase quorum_commit and then reload the configuration file. >> What is a full-cluster crash? > > The event that all of your cluster nodes are down (most probably due to > power failure, but fires or other catastrophic events can be other > causes). Chances for that to happen can certainly be reduced by > distributing to distant locations, but that equally certainly increases > latency, which isn't always an option. Yep. >> Why does it cause a split-brain? > > First master node A fails, a standby B takes over, but then fails as > well. Let node C take over. Then the power aggregates catches fire, the > infamous full-cluster crash (where "lights out management" gets a > completely new meaning ;-) ). > > Split brain would be the situation that arises if all three nodes (A, B > and C) start up again and think they have been the former master, so > they can now continue to apply new transactions. Their data diverges, > leading to what could be seen as a split-brain from the outside. > > Obviously, you must disallow A and B to take the role of the master > after recovery. Yep. Something like STONITH would be required. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Oct 9, 2010 at 1:41 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> And, I'd like to know whether the master waits forever because of the >> standby failure in other solutions such as Oracle DataGuard, MySQL >> semi-synchronous replication. > > MySQL used to be fond of simiply failing sliently. Not sure what 5.4 does, > or Oracle. In any case MySQL's replication has always really been async > (except Cluster, which is a very different database), so it's not really a > comparison. IIRC, MySQL *semi-synchronous* replication is not async, so it can be comparison. Of course, though MySQL default replication is async. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Yes. But if there is no unsent WAL when the master goes down, >> we can start new standby without new backup by copying the >> timeline history file from new master to new standby and >> setting recovery_target_timeline to 'latest'. > > .. and restart the standby. Yes. > It's a pretty severe shortcoming at the moment. For starters, it means that > you need a shared archive, even if you set wal_keep_segments to a high > number. Secondly, it's a lot of scripting to get it working, I don't like > the thought of testing failovers in synchronous replication if I have to do > all that. Frankly, this seems more important to me than synchronous > replication. There seems to be difference in outlook between us. I prefer sync rep. But I'm OK to address that first if it's not hard. > It shouldn't be too hard to fix. Walsender needs to be able to read WAL from > preceding timelines, like recovery does, and walreceiver needs to write the > incoming WAL to the right file. And walsender seems to need to transfer the current timeline history to the standby. Otherwise, the standby cannot recover the WAL file with new timeline. And the standby might need to create the timeline history file in order to recover the WAL file with new timeline even after it's restarted. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 13.10.2010 08:21, Fujii Masao wrote: > On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> It shouldn't be too hard to fix. Walsender needs to be able to read WAL from >> preceding timelines, like recovery does, and walreceiver needs to write the >> incoming WAL to the right file. > > And walsender seems to need to transfer the current timeline history to > the standby. Otherwise, the standby cannot recover the WAL file with new > timeline. And the standby might need to create the timeline history file > in order to recover the WAL file with new timeline even after it's restarted. Yes, true, you need that too. It might be good to divide this work into two phases, teaching archive recovery to notice new timelines appearing in the archive first, and doing the walsender/walreceiver changes after that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Oct 13, 2010 at 2:43 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 13.10.2010 08:21, Fujii Masao wrote: >> >> On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> It shouldn't be too hard to fix. Walsender needs to be able to read WAL >>> from >>> preceding timelines, like recovery does, and walreceiver needs to write >>> the >>> incoming WAL to the right file. >> >> And walsender seems to need to transfer the current timeline history to >> the standby. Otherwise, the standby cannot recover the WAL file with new >> timeline. And the standby might need to create the timeline history file >> in order to recover the WAL file with new timeline even after it's >> restarted. > > Yes, true, you need that too. > > It might be good to divide this work into two phases, teaching archive > recovery to notice new timelines appearing in the archive first, and doing > the walsender/walreceiver changes after that. There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced.I mention this because I have a vague feeling thattimelines are supposed to prevent you from getting different WAL histories confused with each other, but they don't actually cover all the cases that can happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/13/2010 06:43 AM, Fujii Masao wrote: > Unfortunately even enough standbys don't increase write-availability > unless you choose wait-forever. Because, after promoting one of > standbys to new master, you must keep all the transactions waiting > until at least one standby has connected to and caught up with new > master. Currently this wait time is not short. Why is that? Don't the standbies just have to switch from one walsender to another? If there's any significant delay in switching, this either hurts availability or robustness, yes. > Hmm.. that increases the number of procedures which the users must > perform at the failover. I only consider fully automated failover. However, you seem to be worried about the initial setup of sync rep. > At least, the users seem to have to wait > until the standby has caught up with new master, increase quorum_commit > and then reload the configuration file. For switching from a single node to a sync replication setup with one or more standbies, that seems reasonable. There are way more components you need to setup or adjust in such a case (network, load balancer, alerting system and maybe even the application itself). There's really no other option, if you want the kind of robustness guarantee that sync rep with wait forever provides. OTOH, if you just replicate to whatever standby is there and don't care much if it isn't, the admin doesn't need to worry much about quorum_commit - it doesn't have much of an effect anyway. Regards Markus Wanner
On Wed, Oct 13, 2010 at 3:43 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 13.10.2010 08:21, Fujii Masao wrote: >> >> On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> It shouldn't be too hard to fix. Walsender needs to be able to read WAL >>> from >>> preceding timelines, like recovery does, and walreceiver needs to write >>> the >>> incoming WAL to the right file. >> >> And walsender seems to need to transfer the current timeline history to >> the standby. Otherwise, the standby cannot recover the WAL file with new >> timeline. And the standby might need to create the timeline history file >> in order to recover the WAL file with new timeline even after it's >> restarted. > > Yes, true, you need that too. > > It might be good to divide this work into two phases, teaching archive > recovery to notice new timelines appearing in the archive first, and doing > the walsender/walreceiver changes after that. OK. In detail, 1. After failover, when the standby connects to new master, walsender transfers the current timeline history in the handshakeprocessing. 2. If the timeline history in the master is inconsistent with that in the standby, walreceiver terminates the replication connection. 3. Walreceiver creates the timeline history file. 4. Walreceiver signals the change of timeline history to startup process and makes it read the timeline history file. Afterthis, startup process tries to recover the WAL files with even new timeline ID. 5. After the handshake, walsender sends the WAL from preceding timelines, like recovery does, and walreceiver writes theincoming WAL to the right file. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There's another problem here we should think about, too. Suppose you > have a master and two standbys. The master dies. You promote one of > the standbys, which turns out to be behind the other. You then > repoint the other standby at the one you promoted. Congratulations, > your database is now very possible corrupt, and you may very well get > no warning of that fact. It seems to me that we would be well-advised > to install some kind of bullet-proof safeguard against this kind of > problem, so that you will KNOW that the standby needs to be re-synced. Yep. This is why I said it's not easy to implement that. To start the standby without taking a base backup from new master after failover, the user basically has to promote the standby which is ahead of the other standbys (e.g., by comparing pg_last_xlog_replay_location on each standby). As the safeguard, we seem to need to compare the location at the switch of the timeline on the master with the last replay location on the standby. If the latter location is ahead AND the timeline ID of the standby is not the same as that of the master, we should emit warning and terminate the replication connection. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There's another problem here we should think about, too. Suppose you > have a master and two standbys. The master dies. You promote one of > the standbys, which turns out to be behind the other. You then > repoint the other standby at the one you promoted. Congratulations, > your database is now very possible corrupt, and you may very well get > no warning of that fact. It seems to me that we would be well-advised > to install some kind of bullet-proof safeguard against this kind of > problem, so that you will KNOW that the standby needs to be re-synced. > I mention this because I have a vague feeling that timelines are > supposed to prevent you from getting different WAL histories confused > with each other, but they don't actually cover all the cases that can > happen. > Why don't the usual protections kick in here? The new record read from the location the xlog reader is expecting to find it has to have a valid CRC and a correct back pointer to the previous record. If the new wal sender is behind the old one then the new record it's sent won't match up at all. -- greg
On Wed, Oct 13, 2010 at 5:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> There's another problem here we should think about, too. Suppose you >> have a master and two standbys. The master dies. You promote one of >> the standbys, which turns out to be behind the other. You then >> repoint the other standby at the one you promoted. Congratulations, >> your database is now very possible corrupt, and you may very well get >> no warning of that fact. It seems to me that we would be well-advised >> to install some kind of bullet-proof safeguard against this kind of >> problem, so that you will KNOW that the standby needs to be re-synced. > > Yep. This is why I said it's not easy to implement that. > > To start the standby without taking a base backup from new master after > failover, the user basically has to promote the standby which is ahead > of the other standbys (e.g., by comparing pg_last_xlog_replay_location > on each standby). > > As the safeguard, we seem to need to compare the location at the switch > of the timeline on the master with the last replay location on the standby. > If the latter location is ahead AND the timeline ID of the standby is not > the same as that of the master, we should emit warning and terminate the > replication connection. That doesn't seem very bullet-proof. You can accidentally corrupt a standby even when only one time-line is involved. AFAIK, stopping a standby, removing recovery.conf, and starting it up again does not change time lines. You can even shut down the standby, bring it up as a master, generate a little WAL, shut it back down, and bring it back up as a standby pointing to the same master. It would be nice to embed in each checkpoint record an identifier that changes randomly on each transition to normal running, so that if you do something like this we can notice and complain loudly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 14, 2010 at 11:18 AM, Greg Stark <gsstark@mit.edu> wrote: > Why don't the usual protections kick in here? The new record read from > the location the xlog reader is expecting to find it has to have a > valid CRC and a correct back pointer to the previous record. Yep. In most cases, those protections seem to be able to make the standby notice the inconsistency of WAL and then give up continuing replication. But not in all the cases. We can regard those protections as bullet-proof safeguard? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Oct 13, 2010 at 10:18 PM, Greg Stark <gsstark@mit.edu> wrote: > On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> There's another problem here we should think about, too. Suppose you >> have a master and two standbys. The master dies. You promote one of >> the standbys, which turns out to be behind the other. You then >> repoint the other standby at the one you promoted. Congratulations, >> your database is now very possible corrupt, and you may very well get >> no warning of that fact. It seems to me that we would be well-advised >> to install some kind of bullet-proof safeguard against this kind of >> problem, so that you will KNOW that the standby needs to be re-synced. >> I mention this because I have a vague feeling that timelines are >> supposed to prevent you from getting different WAL histories confused >> with each other, but they don't actually cover all the cases that can >> happen. >> > > Why don't the usual protections kick in here? The new record read from > the location the xlog reader is expecting to find it has to have a > valid CRC and a correct back pointer to the previous record. If the > new wal sender is behind the old one then the new record it's sent > won't match up at all. There's some kind of logic that rewinds to the beginning of the WAL segment and tries to replay from there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane wrote: > Greg Smith <greg@2ndquadrant.com> writes: > > I don't see this as needing any implementation any more complicated than > > the usual way such timeouts are handled. Note how long you've been > > trying to reach the standby. Default to -1 for forever. And if you hit > > the timeout, mark the standby as degraded and force them to do a proper > > resync when they disconnect. Once that's done, then they can re-enter > > sync rep mode again, via the same process a new node would have done so. > > Well, actually, that's *considerably* more complicated than just a > timeout. How are you going to "mark the standby as degraded"? The > standby can't keep that information, because it's not even connected > when the master makes the decision. ISTM that this requires > > 1. a unique identifier for each standby (not just role names that > multiple standbys might share); > > 2. state on the master associated with each possible standby -- not just > the ones currently connected. > > Both of those are perhaps possible, but the sense I have of the > discussion is that people want to avoid them. > > Actually, #2 seems rather difficult even if you want it. Presumably > you'd like to keep that state in reliable storage, so it survives master > crashes. But how you gonna commit a change to that state, if you just > lost every standby (suppose master's ethernet cable got unplugged)? > Looks to me like it has to be reliable non-replicated storage. Leaving > aside the question of how reliable it can really be if not replicated, > it's still the case that we have noplace to put such information given > the WAL-is-across-the-whole-cluster design. I assumed we would have a parameter called "sync_rep_failure" that would take a command and the command would be called when communication to the slave was lost. If you restart, it tries again and might call the function again. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +