Thread: Synchronization levels in SR
Hi, I'm now designing the "synchronous" replication feature based on SR for 9.1, while discussing that at another thread. http://archives.postgresql.org/pgsql-hackers/2010-04/msg01516.php At the first design phase, I'd like to clarify which synch levels should be supported 9.1 and how it should be specified by users. The log-shipping replication has some synch levels as follows. The transaction commit on the master #1 doesn't wait for replication (already suppored in 9.0) #2 waits for WAL to bereceived by the standby #3 waits for WAL to be received and flushed by the standby #4 waits for WAL to be received, flushedand replayed by the standby ..etc? Which should we include in 9.1? I'd like to add #2 and #3. They are enough for high-availability use case (i.e., to prevent failover from losing any transactions committed). AFAIR, MySQL semi-synchronous replication supports #2 level. #4 is useful for some cases, but might often make the transaction commit on the master get stuck since read-only query can easily block recovery by the lock conflict. So #4 seems not to be worth working on until that HS problem has been addressed. Thought? Second, we need to discuss about how to specify the synch level. There are three approaches: * Per standby Since the purpose, location and H/W resource often differ from one standby to another, specifying level perstandby (i.e., we set the level in recovery.conf) is a straightforward approach, I think. For example, we can choose #3for high-availability standby near the master, and choose #1 (async) for the disaster recovery standby remote. * Per transaction Define the PGC_USERSET option specifying the level and specify it on the master in response to the purposeof transaction. In this approach, for example, we can choose #4 for the transaction which should be visible on thestandby as soon as a "success" of the commit has been returned to a client. We can also choose #1 for time-critical butnot mission-critical transaction. * Mix Allow users to specify the level per standby and transaction at the same time, and then calculate the real level fromthem by using some algorithm. Which should we adopt for 9.1? I'd like to implement the "per-standby" approach at first since it's simple and seems to cover more use cases. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 24/05/10 16:20, Fujii Masao wrote: > The log-shipping replication has some synch levels as follows. > > The transaction commit on the master > #1 doesn't wait for replication (already suppored in 9.0) > #2 waits for WAL to be received by the standby > #3 waits for WAL to be received and flushed by the standby > #4 waits for WAL to be received, flushed and replayed by > the standby > ..etc? > > Which should we include in 9.1? I'd like to add #2 and #3. > They are enough for high-availability use case (i.e., to > prevent failover from losing any transactions committed). > AFAIR, MySQL semi-synchronous replication supports #2 level. > > #4 is useful for some cases, but might often make the > transaction commit on the master get stuck since read-only > query can easily block recovery by the lock conflict. So > #4 seems not to be worth working on until that HS problem > has been addressed. Thought? I see a lot of value in #4; it makes it possible to distribute read-only load to the standby using something like pgbouncer, completely transparently to the application. In the lesser modes, the application can see slightly stale results. But whatever we can easily implement, really. Pick one that you think is the easiest and start with that, but keep the other modes in mind in the design and in the user interface so that you don't paint yourself in the corner. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> #4 is useful for some cases, but might often make the > transaction commit on the master get stuck since read-only > query can easily block recovery by the lock conflict. So > #4 seems not to be worth working on until that HS problem > has been addressed. Thought? I agree that #4 should be done last, but it will be needed, not in the least by your employer ;-) . I don't see any obvious way to make #4 compatible with any significant query load on the slave, but in general I'd think that users of #4 are far more concerned with 0% data loss than they are with getting the slave to run read queries. > Second, we need to discuss about how to specify the synch > level. There are three approaches: > > * Per standby > > * Per transaction Ach, I'm torn. I can see strong use cases for both of the above. Really, I think: > * Mix > Allow users to specify the level per standby and > transaction at the same time, and then calculate the real > level from them by using some algorithm. What we should do is specify it per-standby, and then have a USERSET GUC on the master which specifies which transactions will be synched, and those will be synched only on the slaves which are set up to support synch. That is, if you have: Master Slave #1: synch Slave #2: not synch Slave #3: not synch And you have: Session #1: synch Session #2: not synch Session #1 will be synched on Slave #1 before commit. Nothing will be synched on Slaves 2 and 3, and session #2 will not wait for synch on any slave. I think this model delivers the maximum HA flexibility to users while still making intuitive logical sense. > Which should we adopt for 9.1? I'd like to implement the > "per-standby" approach at first since it's simple and seems > to cover more use cases. Thought? If people agree that the above is our roadmap, implementing "per-standby" first makes sense, and then we can implement "per-session" GUC later. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, May 25, 2010 at 1:18 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I see a lot of value in #4; it makes it possible to distribute read-only > load to the standby using something like pgbouncer, completely transparently > to the application. Agreed. > In the lesser modes, the application can see slightly > stale results. Yes BTW, even if we got #4, we would need to be careful about that we might see the uncommitted results from the standby. That is, the transaction commit might become visible in the standby before the master returns its "success" to a client. I think that we would never get the completely-transaction-consistent results from the standby until we have implemented the "snapshot cloning" feature. http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions > But whatever we can easily implement, really. Pick one that you think is the > easiest and start with that, but keep the other modes in mind in the design > and in the user interface so that you don't paint yourself in the corner. Yep, the design and implementation for #2 and #3 should be easily extensible for #4. I'll keep in mind that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, May 25, 2010 at 10:29 AM, Josh Berkus <josh@agliodbs.com> wrote: > I agree that #4 should be done last, but it will be needed, not in the > least by your employer ;-) . I don't see any obvious way to make #4 > compatible with any significant query load on the slave, but in general > I'd think that users of #4 are far more concerned with 0% data loss than > they are with getting the slave to run read queries. Since #2 and #3 are enough for 0% data loss, I think that such users would be more concerned about what results are visible in the standby. No? > What we should do is specify it per-standby, and then have a USERSET GUC > on the master which specifies which transactions will be synched, and > those will be synched only on the slaves which are set up to support > synch. That is, if you have: > > Master > Slave #1: synch > Slave #2: not synch > Slave #3: not synch > > And you have: > Session #1: synch > Session #2: not synch > > Session #1 will be synched on Slave #1 before commit. Nothing will be > synched on Slaves 2 and 3, and session #2 will not wait for synch on any > slave. > > I think this model delivers the maximum HA flexibility to users while > still making intuitive logical sense. This makes sense. Since it's relatively easy and simple to implement such a boolean GUC flag rather than "per-transaction" levels (there are four valid values #1, #2, #3 and #4), I'll do that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, May 24, 2010 at 10:20 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > At the first design phase, I'd like to clarify which synch levels > should be supported 9.1 and how it should be specified by users. There is another question about synch level: When should the master wait for replication? In my current design, the backend waits for replication only at the end of the transaction commit. Is this enough? Is there other waiting point? For example, smart or fast shutdown on the master should wait for a shutdown checkpoint record to be replicated to the standby (btw, in 9.0, shutdown waits for checkpoint record to be *sent*)? pg_switch_xlog() needs to wait for all of original WAL file to be replicated? I'm not sure if the above two "waits-for-replication" have use cases, so I'm thinking they are not worth implementing, but.. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2010-05-25 at 12:40 +0900, Fujii Masao wrote: > On Tue, May 25, 2010 at 10:29 AM, Josh Berkus <josh@agliodbs.com> wrote: > > I agree that #4 should be done last, but it will be needed, not in the > > least by your employer ;-) . I don't see any obvious way to make #4 > > compatible with any significant query load on the slave, but in general > > I'd think that users of #4 are far more concerned with 0% data loss than > > they are with getting the slave to run read queries. > > Since #2 and #3 are enough for 0% data loss, I think that such users > would be more concerned about what results are visible in the standby. > No? Please add #4 also. You can do that easily at the same time as #2 and #3, and it will leave me free to fix the perceived conflict problems. -- Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-24 at 22:20 +0900, Fujii Masao wrote: > Second, we need to discuss about how to specify the synch > level. There are three approaches: > > * Per standby > Since the purpose, location and H/W resource often differ > from one standby to another, specifying level per standby > (i.e., we set the level in recovery.conf) is a > straightforward approach, I think. For example, we can > choose #3 for high-availability standby near the master, > and choose #1 (async) for the disaster recovery standby > remote. > > * Per transaction > Define the PGC_USERSET option specifying the level and > specify it on the master in response to the purpose of > transaction. In this approach, for example, we can choose > #4 for the transaction which should be visible on the > standby as soon as a "success" of the commit has been > returned to a client. We can also choose #1 for > time-critical but not mission-critical transaction. > > * Mix > Allow users to specify the level per standby and > transaction at the same time, and then calculate the real > level from them by using some algorithm. > > Which should we adopt for 9.1? I'd like to implement the > "per-standby" approach at first since it's simple and seems > to cover more use cases. Thought? -1 Synchronous replication implies that a commit should wait. This wait is experienced by the transaction, not by other parts of the system. If we define robustness at the standby level then robustness depends upon unseen administrators, as well as the current up/down state of standbys. This is action-at-a-distance in its worst form. Imagine having 2 standbys, 1 synch, 1 async. If the synch server goes down, performance will improve and robustness will have been lost. What good would that be? Imagine a standby connected over a long distance. DBA brings up standby in synch mode accidentally and the primary server hits massive performance problems without any way of directly controlling this. The worst aspect of standby-level controls is that nobody ever knows how safe a transaction is. There is no definition or test for us to check exactly how safe any particular transaction is. Also, the lack of safety occurs at the time when you least want it - when one of your servers is already down. So I call "per-standby" settings simple, and broken in multiple ways. Putting the control in the hands of the transaction owner (i.e. on the master) is exactly where the control should be. I personally like the idea of that being a USERSET, though could live with system wide settings if need be. But the control must be on the *master* not on the standbys. The best parameter we can specify is the number of servers that we wish to wait for confirmation from. That is a definition that easily manages the complexity of having various servers up/down at any one time. It also survives misconfiguration more easily, as well as providing a workaround if replicating across a bursty network where we can't guarantee response times, even of the typical response time is good. (We've discussed this many times before over a period of years and not really sure why we have to re-discuss this repeatedly just because people disagree. You don't mention the earlier discussions, not sure why. If we want to follow the community process, then all previous discussions need to be taken into account, unless things have changed - which they haven't: same topic, same people, AFAICS.) -- Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-24 at 18:29 -0700, Josh Berkus wrote: > If people agree that the above is our roadmap, implementing > "per-standby" first makes sense, and then we can implement "per-session" > GUC later. IMHO "per-standby" sounds simple, but is dangerously simplistic, explained on another part of the thread. We need to think clearly about failure modes and how they will be handled. Failure modes and edge cases completely govern the design here. "All running smoothly" isn't a major concern and so it appears that the user interface can be done various ways. -- Simon Riggs www.2ndQuadrant.com
On Tue, May 25, 2010 at 12:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Synchronous replication implies that a commit should wait. This wait is > experienced by the transaction, not by other parts of the system. If we > define robustness at the standby level then robustness depends upon > unseen administrators, as well as the current up/down state of standbys. > This is action-at-a-distance in its worst form. Maybe, but I can't help thinking people are going to want some form of this. The case where someone wants to do sync rep to the machine in the next rack over and async rep to a server at a remote site seems too important to ignore. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-05-25 at 12:40 -0400, Robert Haas wrote: > On Tue, May 25, 2010 at 12:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Synchronous replication implies that a commit should wait. This wait is > > experienced by the transaction, not by other parts of the system. If we > > define robustness at the standby level then robustness depends upon > > unseen administrators, as well as the current up/down state of standbys. > > This is action-at-a-distance in its worst form. > > Maybe, but I can't help thinking people are going to want some form of > this. The case where someone wants to do sync rep to the machine in > the next rack over and async rep to a server at a remote site seems > too important to ignore. Uhh yeah, that is pretty much the standard use case. The "next rack" is only 50% of the equation. The next part is the disaster recovery rack over 100Mb (or even 10Mb) that is half way across the country. It is common, very common. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
Hello All: In the code (costsize.c), I see that effective_cache_size is set to DEFAULT_EFFECTIVE_CACHE_SIZE. This is defined as follows in cost.h #define DEFAULT_EFFECTIVE_CACHE_SIZE 16384 But when I say show shared_buffers in psql I get, shared_buffers ---------------- 28MB In postgresql.conf file, the following lines appear shared_buffers = 28MB # min 128kB # (change requires restart) #temp_buffers = 8MB # min 800kB So I am assuming that the buffer pool size is 28MB = 28 * 128 = 3584 8K pages. So should effective_cache_size be set to 3584 rather than the 16384? Thanks, MMK. |
Robert Haas <robertmhaas@gmail.com> wrote: > Simon Riggs <simon@2ndquadrant.com> wrote: >> If we define robustness at the standby level then robustness >> depends upon unseen administrators, as well as the current >> up/down state of standbys. This is action-at-a-distance in its >> worst form. > > Maybe, but I can't help thinking people are going to want some > form of this. The case where someone wants to do sync rep to the > machine in the next rack over and async rep to a server at a > remote site seems too important to ignore. I think there may be a terminology issue here -- I took "configure by standby" to mean that *at the master* you would specify rules for each standby. I think Simon took it to mean that each standby would define the rules for replication to it. Maybe this issue can resolve gracefully with a bit of clarification? -Kevin
On Tue, 2010-05-25 at 12:40 -0400, Robert Haas wrote: > On Tue, May 25, 2010 at 12:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Synchronous replication implies that a commit should wait. This wait is > > experienced by the transaction, not by other parts of the system. If we > > define robustness at the standby level then robustness depends upon > > unseen administrators, as well as the current up/down state of standbys. > > This is action-at-a-distance in its worst form. > > Maybe, but I can't help thinking people are going to want some form of > this. > The case where someone wants to do sync rep to the machine in > the next rack over and async rep to a server at a remote site seems > too important to ignore. The use case of "machine in the next rack over and async rep to a server at a remote site" *is* important, but you give no explanation as to why that implies "per-standby" is the solution to it. If you read the rest of my email, you'll see that I have explained the problems "per-standby" settings would cause. Please don't be so quick to claim it is me ignoring anything. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-05-25 at 11:52 -0500, Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > > Simon Riggs <simon@2ndquadrant.com> wrote: > >> If we define robustness at the standby level then robustness > >> depends upon unseen administrators, as well as the current > >> up/down state of standbys. This is action-at-a-distance in its > >> worst form. > > > > Maybe, but I can't help thinking people are going to want some > > form of this. The case where someone wants to do sync rep to the > > machine in the next rack over and async rep to a server at a > > remote site seems too important to ignore. > > I think there may be a terminology issue here -- I took "configure > by standby" to mean that *at the master* you would specify rules for > each standby. I think Simon took it to mean that each standby would > define the rules for replication to it. Maybe this issue can > resolve gracefully with a bit of clarification? The use case of "machine in the next rack over and async rep to a server at a remote site" would require the settings server.nextrack = synch server.remotesite = async which leaves open the question of what happens when "nextrack" is down. In many cases, to give adequate performance in that situation people add an additional server, so the config becomes server.nextrack1 = synch server.nextrack2 = synch server.remotesite = async We then want to specify for performance reasons that we can get a reply from either nextrack1 or nextrack2, so it all still works safely and quickly if one of them is down. How can we express that rule concisely? With some difficulty. My suggestion is simply to have a single parameter (name unimportant) number_of_synch_servers_we_wait_for = N which is much easier to understand because it is phrased in terms of the guarantee given to the transaction, not in terms of what the admin thinks is the situation. -- Simon Riggs www.2ndQuadrant.com
On Tue, May 25, 2010 at 1:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-05-25 at 11:52 -0500, Kevin Grittner wrote: >> Robert Haas <robertmhaas@gmail.com> wrote: >> > Simon Riggs <simon@2ndquadrant.com> wrote: >> >> If we define robustness at the standby level then robustness >> >> depends upon unseen administrators, as well as the current >> >> up/down state of standbys. This is action-at-a-distance in its >> >> worst form. >> > >> > Maybe, but I can't help thinking people are going to want some >> > form of this. The case where someone wants to do sync rep to the >> > machine in the next rack over and async rep to a server at a >> > remote site seems too important to ignore. >> >> I think there may be a terminology issue here -- I took "configure >> by standby" to mean that *at the master* you would specify rules for >> each standby. I think Simon took it to mean that each standby would >> define the rules for replication to it. Maybe this issue can >> resolve gracefully with a bit of clarification? > > The use case of "machine in the next rack over and async rep to a server > at a remote site" would require the settings > > server.nextrack = synch > server.remotesite = async > > which leaves open the question of what happens when "nextrack" is down. > > In many cases, to give adequate performance in that situation people add > an additional server, so the config becomes > > server.nextrack1 = synch > server.nextrack2 = synch > server.remotesite = async > > We then want to specify for performance reasons that we can get a reply > from either nextrack1 or nextrack2, so it all still works safely and > quickly if one of them is down. How can we express that rule concisely? > With some difficulty. Perhaps the difficulty here is that those still look like per-server settings to me. Just maybe with a different set of semantics. > My suggestion is simply to have a single parameter (name unimportant) > > number_of_synch_servers_we_wait_for = N > > which is much easier to understand because it is phrased in terms of the > guarantee given to the transaction, not in terms of what the admin > thinks is the situation. So I agree that we need to talk about whether or not we want to do this. I'll give my opinion. I am not sure how useful this really is.Consider a master with two standbys. The master commitsa transaction and waits for one of the two standbys, then acknowledges the commit back to the user. Then the master crashes. Now what? It's not immediately obvious which standby we should being online as the primary, and if we guess wrong we could lose transactions thought to be committed. This is probably a solvable problem, with enough work: we can write a script to check the last LSN received by each of the two standbys and promote whichever one is further along. But... what happens if the master and one standby BOTH crash simultaneously? There's no way of knowing (until we get at least one of them back up) whether it's safe to promote the other standby. I like the idea of a "quorum commit" type feature where we promise the user that things are committed when "enough" servers have acknowledged the commit. But I think most people are not going to want that configuration unless we also provide some really good management tools that we don't have today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 25/05/10 19:49, MMK wrote: > Hello All: > In the code (costsize.c), I see that effective_cache_size is set to DEFAULT_EFFECTIVE_CACHE_SIZE. > This is defined as follows in cost.h > #define DEFAULT_EFFECTIVE_CACHE_SIZE 16384 > But when I say > show shared_buffers in psql I get, > shared_buffers ---------------- 28MB > In postgresql.conf file, the following lines appear > shared_buffers = 28MB # min 128kB # (change requires restart)#temp_buffers = 8MB # min 800kB > > So I am assuming that the buffer pool size is 28MB = 28 * 128 = 3584 8K pages. > So should effective_cache_size be set to 3584 rather than the 16384? No. Please see the manual for what effective_cache_size means: http://www.postgresql.org/docs/8.4/interactive/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-05-25 at 13:31 -0400, Robert Haas wrote: > On Tue, May 25, 2010 at 1:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2010-05-25 at 11:52 -0500, Kevin Grittner wrote: > >> Robert Haas <robertmhaas@gmail.com> wrote: > >> > Simon Riggs <simon@2ndquadrant.com> wrote: > >> >> If we define robustness at the standby level then robustness > >> >> depends upon unseen administrators, as well as the current > >> >> up/down state of standbys. This is action-at-a-distance in its > >> >> worst form. > >> > > >> > Maybe, but I can't help thinking people are going to want some > >> > form of this. The case where someone wants to do sync rep to the > >> > machine in the next rack over and async rep to a server at a > >> > remote site seems too important to ignore. > >> > >> I think there may be a terminology issue here -- I took "configure > >> by standby" to mean that *at the master* you would specify rules for > >> each standby. I think Simon took it to mean that each standby would > >> define the rules for replication to it. Maybe this issue can > >> resolve gracefully with a bit of clarification? > > > > The use case of "machine in the next rack over and async rep to a server > > at a remote site" would require the settings > > > > server.nextrack = synch > > server.remotesite = async > > > > which leaves open the question of what happens when "nextrack" is down. > > > > In many cases, to give adequate performance in that situation people add > > an additional server, so the config becomes > > > > server.nextrack1 = synch > > server.nextrack2 = synch > > server.remotesite = async > > > > We then want to specify for performance reasons that we can get a reply > > from either nextrack1 or nextrack2, so it all still works safely and > > quickly if one of them is down. How can we express that rule concisely? > > With some difficulty. > > Perhaps the difficulty here is that those still look like per-server > settings to me. Just maybe with a different set of semantics. (Those are the per-server settings.) -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-05-25 at 19:08 +0200, Alastair Turner wrote: > On Tue, May 25, 2010 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > ....... > > > > The best parameter we can specify is the number of servers that we wish > > to wait for confirmation from. That is a definition that easily manages > > the complexity of having various servers up/down at any one time. It > > also survives misconfiguration more easily, as well as providing a > > workaround if replicating across a bursty network where we can't > > guarantee response times, even of the typical response time is good. > > > > This may be an incredibly naive question, but what happens to the > transaction on the master if the number of confirmations is not > received? Is this intended to create a situation where the master > effectively becomes unavailable for write operations when its > synchronous slaves are unavailable? How we handle degraded mode is important, yes. Whatever parameters we choose the problem will remain the same. Should we just ignore degraded mode and respond as if nothing bad had happened? Most people would say not. If we specify server1 = synch and server2 = async we then also need to specify what happens if server1 is down. People might often specifyif (server1 == down) server2 = synch. So now we have 3 configuration settings, one quite complex. It's much easier to say you want to wait for N servers to respond, but don't care which they are. One parameter, simple and flexible. In both cases, we have to figure what to do if we can't get either server to respond. In replication there is no such thing as "server down" just a "server didn't reply in time X". So we need to define timeouts. So whatever we do, we need additional parameters to specify timeouts (including wait-forever as an option) and action-on-timeout: commit or rollback. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-05-25 at 13:31 -0400, Robert Haas wrote: > So I agree that we need to talk about whether or not we want to do > this. I'll give my opinion. I am not sure how useful this really is. > Consider a master with two standbys. The master commits a > transaction and waits for one of the two standbys, then acknowledges > the commit back to the user. Then the master crashes. Now what? > It's not immediately obvious which standby we should being online as > the primary, and if we guess wrong we could lose transactions thought > to be committed. This is probably a solvable problem, with enough > work: we can write a script to check the last LSN received by each of > the two standbys and promote whichever one is further along. > > But... what happens if the master and one standby BOTH crash > simultaneously? There's no way of knowing (until we get at least one > of them back up) whether it's safe to promote the other standby. Not much of a problem really, is it? If you have one server left out of 3, then you promote it OR you stay down - your choice. There is no "safe to promote" knowledge in *any* scenario; you never know what was on the primary, only what was received by the standby. If you have N standbys still up, you can pick which using the algorithm you mention. Remember that the WAL is sequential, so its not like the commit order of transactions will differ across servers if we use quorum commit. So not a problem. The multiple simultaneous case is fairly common for people that pick the "synch to server in next rack" because there's a 100 reasons why we'd take out both at the same time, ask JD. > I like the idea of a "quorum commit" type feature where we promise the > user that things are committed when "enough" servers have acknowledged > the commit. But I think most people are not going to want that > configuration unless we also provide some really good management tools > that we don't have today. Good name. Management tools has nothing to do with this; completely orthogonal. -- Simon Riggs www.2ndQuadrant.com
On Tue, May 25, 2010 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: ....... > > The best parameter we can specify is the number of servers that we wish > to wait for confirmation from. That is a definition that easily manages > the complexity of having various servers up/down at any one time. It > also survives misconfiguration more easily, as well as providing a > workaround if replicating across a bursty network where we can't > guarantee response times, even of the typical response time is good. > This may be an incredibly naive question, but what happens to the transaction on the master if the number of confirmations is not received? Is this intended to create a situation where the master effectively becomes unavailable for write operations when its synchronous slaves are unavailable? Alastair "Bell" Turner ^F5
Simon Riggs wrote: > How we handle degraded mode is important, yes. Whatever parameters we > choose the problem will remain the same. > > Should we just ignore degraded mode and respond as if nothing bad had > happened? Most people would say not. > > If we specify server1 = synch and server2 = async we then also need to > specify what happens if server1 is down. People might often specify > if (server1 == down) server2 = synch. > I have a hard time imagining including async servers in the quorum. If an async servers vote is necessary to reach quorum due to a 'real' sync standby server failure, it would mean that the async-intended standby is now also in sync with the master transactions. IMHO this is a bad situation, since instead of the DBA getting the error: "not enough sync standbys to reach quorum", he'll now get "database is slow" complaints, only to find out later that too much sync standby servers went south. (under the assumption that async servers are mostly on too slow links to consider for sync standby). regards, Yeb Havinga
Hi, Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-05-25 at 19:08 +0200, Alastair Turner wrote: >> On Tue, May 25, 2010 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > The best parameter we can specify is the number of servers that we wish >> > to wait for confirmation from. >> >> This may be an incredibly naive question, but what happens to the >> transaction on the master if the number of confirmations is not >> received? > > It's much easier to say you want to wait for N servers to respond, but > don't care which they are. One parameter, simple and flexible. [...] > So whatever we do, we need additional parameters to specify timeouts > (including wait-forever as an option) and action-on-timeout: commit or > rollback. I was preparing an email on the line that we need each slave to declare its desired minimum level of synchronicity, and have the master filter that with what the transaction wants. Scratch that. Thinking about it some more, I see that Simon's proposal is both more simple and effective: we already have Hot Standby and admin functions that tells us the last replayed LSN. The bigger wins. So in case of failover we know which slave to choose. The only use case I can see for what I had in mind is to allow the user to choose which server is trusted to have accurate data or better read only performances. But if the link is slow, the code will soon enough notice, mind you. I'm still not sure about my preference here, but I can see why Simon's proposal is simpler and addresses all concerns apart from forcing the servers into a non-optimal setup for a gain that is uneasy to see. Regards, -- dim
<table border="0" cellpadding="0" cellspacing="0"><tr><td style="font: inherit;" valign="top">Hello Heikki:<br /><br />Thisis what the documentation says (see below).<br /><br />But it does not tell my anything about what the actual buffersize is.<br />How do I know what the real buffer size is? I am using 8.4.4 and I am running only one query at a time.<br/><br />Cheers,<br /><br />MMK.<br /><br />Sets the planner's assumption about the effective size of the disk cachethat is available to a single query. This is factored into estimates of the cost of using an index; a higher value makesit more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When settingthis parameter you should consider both <span class="PRODUCTNAME">PostgreSQL</span>'s shared buffers and the portionof the kernel's disk cache that will be used for <span class="PRODUCTNAME">PostgreSQL</span> data files. Also, takeinto account the expected number of concurrent queries on different tables, since they will have to share the availablespace. This parameter has no effect on the size of shared memory allocated by <span class="PRODUCTNAME">PostgreSQL</span>,nor does it reserve kernel disk cache; it is used only for estimation purposes. Thedefault is 128 megabytes (<tt class="LITERAL">128MB</tt>). <br /><br /><br /><br />--- On <b>Tue, 5/25/10, Heikki Linnakangas<i><heikki.linnakangas@enterprisedb.com></i></b> wrote:<br /><blockquote style="border-left: 2px solid rgb(16,16, 255); margin-left: 5px; padding-left: 5px;"><br />From: Heikki Linnakangas <heikki.linnakangas@enterprisedb.com><br/>Subject: Re: [HACKERS] Confused about the buffer pool size<br />To: "MMK"<bomuvi@yahoo.com><br />Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org><br />Date: Tuesday,May 25, 2010, 11:36 AM<br /><br /><div class="plainMail">On 25/05/10 19:49, MMK wrote:<br />> Hello All:<br />>In the code (costsize.c), I see that effective_cache_size is set to DEFAULT_EFFECTIVE_CACHE_SIZE.<br />> This isdefined as follows in cost.h<br />> #define DEFAULT_EFFECTIVE_CACHE_SIZE 16384<br />> But when I say<br />> showshared_buffers in psql I get,<br />> shared_buffers ---------------- 28MB<br />> In postgresql.conf file, the followinglines appear<br />> shared_buffers = 28MB # min 128kB # (change requires restart)#temp_buffers= 8MB # min 800kB<br />> <br />> So I am assuming that the buffer pool sizeis 28MB = 28 * 128 = 3584 8K pages.<br />> So should effective_cache_size be set to 3584 rather than the 16384?<br/><br />No. Please see the manual for what effective_cache_size means:<br /><br /><a href="http://www.postgresql.org/docs/8.4/interactive/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE" target="_blank">http://www.postgresql.org/docs/8.4/interactive/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE</a><br /><br/>-- Heikki Linnakangas<br /> EnterpriseDB <a href="http://www.enterprisedb.com" target="_blank">http://www.enterprisedb.com</a><br/><br />-- Sent via pgsql-hackers mailing list (<a href="/mc/compose?to=pgsql-hackers@postgresql.org" ymailto="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/>To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></div></blockquote></td></tr></table><br />
On Tue, 2010-05-25 at 21:19 +0200, Yeb Havinga wrote: > Simon Riggs wrote: > > How we handle degraded mode is important, yes. Whatever parameters we > > choose the problem will remain the same. > > > > Should we just ignore degraded mode and respond as if nothing bad had > > happened? Most people would say not. > > > > If we specify server1 = synch and server2 = async we then also need to > > specify what happens if server1 is down. People might often specify > > if (server1 == down) server2 = synch. > > > I have a hard time imagining including async servers in the quorum. If > an async servers vote is necessary to reach quorum due to a 'real' sync > standby server failure, it would mean that the async-intended standby is > now also in sync with the master transactions. IMHO this is a bad > situation, since instead of the DBA getting the error: "not enough sync > standbys to reach quorum", he'll now get "database is slow" complaints, > only to find out later that too much sync standby servers went south. > (under the assumption that async servers are mostly on too slow links to > consider for sync standby). Yeh, there's difficulty either way. We don't need to think of servers as being "synch" or "async", more likely we would rate them in terms of typical synchronisation delay. So yeh, calling them "fast" and "slow" in terms of synchronisation delay makes sense. Some people with low xact rate and high need for protection might want to switch across to the slow server and keep running. If not, the max_synch_delay would trip and you would then select synch_failure_action = rollback. The realistic response is to add a second "fast" sync server, to allow you to stay up even when you lose one of the fast servers. That now gives you 4 servers and the failure modes start to get real complex. Specifying rules to achieve what you're after would be much harder. Some people might want that, but most people won't in the general case and if they did specify them they'd likely get them wrong. All of these issues show why I want to specify the synchronisation mode as a USERSET. That will allow us to specify more easily which parts of our application are important when the cluster is degraded and which data is so critical it must reach multiple servers. -- Simon Riggs www.2ndQuadrant.com
MMK, > But it does not tell my anything about what the actual buffer size is. > How do I know what the real buffer size is? I am using 8.4.4 and I am > running only one query at a time. Please move this discussion to the pgsql-general or pgsql-performance lists. pgsql-hackers is for working on PostgreSQL code, and further questions on this list will probably not be answered. Other than that, I have no idea what you mean by "buffer size", nor why you need to know it. I'd suggest starting your post on the other mailing list by explaining what specific problem you're trying to solve. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, 2010-05-25 at 12:40 -0400, Robert Haas wrote: > On Tue, May 25, 2010 at 12:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Synchronous replication implies that a commit should wait. This wait is > > experienced by the transaction, not by other parts of the system. If we > > define robustness at the standby level then robustness depends upon > > unseen administrators, as well as the current up/down state of standbys. > > This is action-at-a-distance in its worst form. > > Maybe, but I can't help thinking people are going to want some form of > this. The case where someone wants to do sync rep to the machine in > the next rack over and async rep to a server at a remote site seems > too important to ignore. Uhh yeah, that is pretty much the standard use case. The "next rack" is only 50% of the equation. The next part is the disaster recovery rack over 100Mb (or even 10Mb) that is half way across the country. It is common, very common. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On May 25, 2010, at 22:16 , Simon Riggs wrote: > All of these issues show why I want to specify the synchronisation mode > as a USERSET. That will allow us to specify more easily which parts of > our application are important when the cluster is degraded and which > data is so critical it must reach multiple servers. Hm, but since flushing a important COMMIT to the slave(s) will also need to flush all previous (potentially unimportant)COMMITs to the slave(s), isn't there a substantial chance of priority-inversion type problems there? Then again, if asynchronous_commit proved to be effective than so will this probably, so maybe my fear is unjustified. best regards, Florian Pflug
On Wed, May 26, 2010 at 2:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > My suggestion is simply to have a single parameter (name unimportant) > > number_of_synch_servers_we_wait_for = N > > which is much easier to understand because it is phrased in terms of the > guarantee given to the transaction, not in terms of what the admin > thinks is the situation. How can we choose #2, #3 or #4 by using your proposed option? As the result of the discussion, I'm inclined towards choosing the "mix" approach. How about specifying #1, #2, #3 or #4 per standby, and specifying the number of "synchronous" (i.e., means #2, #3 or #4) standbys the transaction commit waits for at the master as Simon suggests? We add new option "replication_mode" (better name?) specifying when the standby sends the ACK meaning the completion of replication to the master into recovery.conf. Valid values are "async", "recv", "fsync" and "redo". Those correspond to #1, #2, #3 and #4 I defined on the top of the thread. If "async", the standby never sends any ACK. If "recv", "fsync", or "redo", the standby sends the ACK when it has received, fsynced or replayed the WAL from the master, respectively. On the other hand, we add new GUC "max_synchronous_standbys" (I prefer it to "number_of_synch_servers_we_wait_for", but does anyone have better name?) as PGC_USERSET into postgresql.conf. It specifies the maximum number of standbys which transaction commit must wait for the ACK from. If max_synchronous_standbys is 0, no transaction commit waits for ACK even if some connected standbys set their replication_mode to "recv", "fsync" or "redo". If it's positive, transaction comit waits for N ACKs. N is the smaller number between max_synchronous_standbys and the actual number of connected "synchronous" standbys. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, May 25, 2010 at 11:36 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, May 26, 2010 at 2:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> My suggestion is simply to have a single parameter (name unimportant) >> >> number_of_synch_servers_we_wait_for = N >> >> which is much easier to understand because it is phrased in terms of the >> guarantee given to the transaction, not in terms of what the admin >> thinks is the situation. > > How can we choose #2, #3 or #4 by using your proposed option? > > As the result of the discussion, I'm inclined towards choosing the > "mix" approach. How about specifying #1, #2, #3 or #4 per standby, > and specifying the number of "synchronous" (i.e., means #2, #3 or > #4) standbys the transaction commit waits for at the master as > Simon suggests? > > We add new option "replication_mode" (better name?) specifying > when the standby sends the ACK meaning the completion of replication > to the master into recovery.conf. Valid values are "async", "recv", > "fsync" and "redo". Those correspond to #1, #2, #3 and #4 I defined > on the top of the thread. > > If "async", the standby never sends any ACK. If "recv", "fsync", > or "redo", the standby sends the ACK when it has received, fsynced > or replayed the WAL from the master, respectively. > > On the other hand, we add new GUC "max_synchronous_standbys" > (I prefer it to "number_of_synch_servers_we_wait_for", but does > anyone have better name?) as PGC_USERSET into postgresql.conf. > It specifies the maximum number of standbys which transaction > commit must wait for the ACK from. > > If max_synchronous_standbys is 0, no transaction commit waits for > ACK even if some connected standbys set their replication_mode to > "recv", "fsync" or "redo". If it's positive, transaction comit waits > for N ACKs. N is the smaller number between max_synchronous_standbys > and the actual number of connected "synchronous" standbys. > > Thought? I think we're over-engineering this. For a first version we should do something simple. Then we can add some of these extra knobs in a follow-on patch. Quorum commit is definitely an extra knob, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, May 26, 2010 at 1:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-05-25 at 12:40 +0900, Fujii Masao wrote: >> On Tue, May 25, 2010 at 10:29 AM, Josh Berkus <josh@agliodbs.com> wrote: >> > I agree that #4 should be done last, but it will be needed, not in the >> > least by your employer ;-) . I don't see any obvious way to make #4 >> > compatible with any significant query load on the slave, but in general >> > I'd think that users of #4 are far more concerned with 0% data loss than >> > they are with getting the slave to run read queries. >> >> Since #2 and #3 are enough for 0% data loss, I think that such users >> would be more concerned about what results are visible in the standby. >> No? > > Please add #4 also. You can do that easily at the same time as #2 and > #3, and it will leave me free to fix the perceived conflict problems. I think that we should implement the feature in small steps rather than submit one big patch at a time. So I'd like to focus on #2 and #3 at first, and #4 later (maybe third or fourth CF). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2010-05-25 at 23:59 -0400, Robert Haas wrote: > Quorum commit is definitely an extra knob, IMHO. No, its about three less, as I have explained. Explain your position, don't just demand others listen. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 13:03 +0900, Fujii Masao wrote: > On Wed, May 26, 2010 at 1:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2010-05-25 at 12:40 +0900, Fujii Masao wrote: > >> On Tue, May 25, 2010 at 10:29 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> > I agree that #4 should be done last, but it will be needed, not in the > >> > least by your employer ;-) . I don't see any obvious way to make #4 > >> > compatible with any significant query load on the slave, but in general > >> > I'd think that users of #4 are far more concerned with 0% data loss than > >> > they are with getting the slave to run read queries. > >> > >> Since #2 and #3 are enough for 0% data loss, I think that such users > >> would be more concerned about what results are visible in the standby. > >> No? > > > > Please add #4 also. You can do that easily at the same time as #2 and > > #3, and it will leave me free to fix the perceived conflict problems. > > I think that we should implement the feature in small steps rather than > submit one big patch at a time. So I'd like to focus on #2 and #3 at first, > and #4 later (maybe third or fourth CF). We both know if you do #2 and #3 then doing #4 also is trivial. If you leave it out then we'll end up missing something that is required and have to rework everything. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 12:36 +0900, Fujii Masao wrote: > On Wed, May 26, 2010 at 2:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > My suggestion is simply to have a single parameter (name unimportant) > > > > number_of_synch_servers_we_wait_for = N > > > > which is much easier to understand because it is phrased in terms of the > > guarantee given to the transaction, not in terms of what the admin > > thinks is the situation. > > How can we choose #2, #3 or #4 by using your proposed option? > If "async", the standby never sends any ACK. If "recv", "fsync", > or "redo", the standby sends the ACK when it has received, fsynced > or replayed the WAL from the master, respectively. Everything I've said about "per-standby" settings applies here, which was based upon having just 2 settings: sync and async. If you have four settings instead, things get even more complex. If we were going to reduce complexity, it would be to reduce the number of options here to just offering option #2 in the first phase. AFAICS people would only ever select #2 or #4 anyway. IMHO #3 isn't likely to be selected on its own because it performs badly for no real benefit. Having two standbys, I might want to specify #2 to both, or if one is down then #3 to the remaining standby instead. Nobody else has yet tried to explain how we would specify what happens when one of the standbys is down, with per-standby settings. Failure modes are where the complexity is here. However we proceed, we must have a discussion about how we specify the failure modes. This is not something we should add on at the last minute, we should think about that now and address it openly. Oracle Data Guard is a great resource for what semantics we might need to cover, but its also a lesson in complexity from its per-standby settings. Please look at net_timeout and alternate options in particular. See how difficult it is to specify failure modes, even though Data Guard offers probably dozens of parameters and options - its orientation is per-standby not towards the transaction and the user. > On the other hand, we add new GUC "max_synchronous_standbys" > (I prefer it to "number_of_synch_servers_we_wait_for", but does > anyone have better name?) as PGC_USERSET into postgresql.conf. > It specifies the maximum number of standbys which transaction > commit must wait for the ACK from. > > If max_synchronous_standbys is 0, no transaction commit waits for > ACK even if some connected standbys set their replication_mode to > "recv", "fsync" or "redo". If it's positive, transaction comit waits > for N ACKs. N is the smaller number between max_synchronous_standbys > and the actual number of connected "synchronous" standbys. To summarise, I think we can get away with just 3 parameters: synchronous_replication = N # similar in name to synchronous_commit synch_rep_timeout = T synch_rep_timeout_action = commit | abort Conceptually, this is "I want at least N replica copies made of my database changes, I will wait for up to T milliseconds to get that otherwise I will do X". Very easy and clear for an application to understand what guarantees it is requesting. Also very easy for the administrator to understand the guarantees requested and how to provision for them: to deliver robustness they typically need N+1 servers, or for even higher levels of robustness and performance N+2 etc.. Making synchronous_replication into a USERSET would be an industry first: transaction controlled robustness at every level. -- Simon Riggs www.2ndQuadrant.com
On Wed, May 26, 2010 at 5:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Everything I've said about "per-standby" settings applies here, which > was based upon having just 2 settings: sync and async. If you have four > settings instead, things get even more complex. If we were going to > reduce complexity, it would be to reduce the number of options here to > just offering option #2 in the first phase. > > AFAICS people would only ever select #2 or #4 anyway. IMHO #3 isn't > likely to be selected on its own because it performs badly for no real > benefit. Having two standbys, I might want to specify #2 to both, or if > one is down then #3 to the remaining standby instead. I guess that dropping the support of #3 doesn't reduce complexity since the code of #3 is almost the same as that of #2. Like walreceiver sends the ACK after receiving the WAL in #2 case, it has only to do the same thing after the WAL flush. > Nobody else has yet tried to explain how we would specify what happens > when one of the standbys is down, with per-standby settings. Failure > modes are where the complexity is here. However we proceed, we must have > a discussion about how we specify the failure modes. This is not > something we should add on at the last minute, we should think about > that now and address it openly. >> Imagine having 2 standbys, 1 synch, 1 async. If the synch server goes >> down, performance will improve and robustness will have been lost. What >> good would that be? You are concerned about the above case you described on another post? In that case, if you want to ensure robustness, you can specify #2, #3 or #4 in both standbys. If one of standbys is in remote site, we can additionally set max_synchronous_standbys to 1. If you don't want to failover to the standby in remote site when the master goes down, you can specify #1 in remote standby, so the standby in the near location is always guaranteed to be synch with the master. > Oracle Data Guard is a great resource for what semantics we might need > to cover, but its also a lesson in complexity from its per-standby > settings. Please look at net_timeout and alternate options in > particular. See how difficult it is to specify failure modes, even > though Data Guard offers probably dozens of parameters and options - its > orientation is per-standby not towards the transaction and the user. Yeah, I'll research Oracle Data Guard. > To summarise, I think we can get away with just 3 parameters: > synchronous_replication = N # similar in name to synchronous_commit > synch_rep_timeout = T > synch_rep_timeout_action = commit | abort I agree to add the latter two parameters, which are also listed on my outline of SynchRep. http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability > Conceptually, this is "I want at least N replica copies made of my > database changes, I will wait for up to T milliseconds to get that > otherwise I will do X". Very easy and clear for an application to > understand what guarantees it is requesting. Also very easy for the > administrator to understand the guarantees requested and how to > provision for them: to deliver robustness they typically need N+1 > servers, or for even higher levels of robustness and performance N+2 > etc.. I don't feel that "synchronous_replication" approach is intuitive for the administrator. Even on this thread, some people seem to prefer "per-standby" setting. Without "per-standby" setting, when there are two standbys, one is in the near rack and another is in remote site, "synchronous_replication=1" cannot guarantee that the near standby is always synch with the master. So when the master goes down, unfortunately we might have to failover to the remote standby. OTOH, "synchronous_replication=2" degrades the performance on the master very much. "synchronous_replication" approach doesn't seem to cover the typical use case. Also, when "synchronous_replication=1" and one of synchronous standbys goes down, how should the surviving standby catch up with the master? Such standby might be too far behind the master. The transaction commit should wait for the ACK from the lagging standby immediately even if there might be large gap? If yes, "synch_rep_timeout" would screw up the replication easily. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, May 26, 2010 at 2:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-05-25 at 23:59 -0400, Robert Haas wrote: >> Quorum commit is definitely an extra knob, IMHO. > > No, its about three less, as I have explained. > > Explain your position, don't just demand others listen. OK. In words of one syllable, your way still has all the same knobs, plus some more. You sketched out a design which still had a per-standby setting for each standby, but IN ADDITION had a setting for a setting to control quorum commit[1]. You also argued that we needed four options for each transaction rather than three[2], and that we need a userset GUC to control the behavior on a per-transaction basis[3]. Not one other person has agreed that we need all of these options in the first version of the patch. We don't. We can start with a sync rep patch that does ONE thing and does it well, and we can add these other things later. I don't think I'm going too far out on a limb when I say that it is easier to get a smaller patch committed than it is to get a bigger one committed, and it is less likely to have bugs. [1] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01347.php [2] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01333.php [3] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01334.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
A suggestion, based on what I believe would be ideal default settings for a fully developed SR capability. The thought being that as long as the default behaviour was stable additional knobs could be added across version boundaries without causing trouble. Per slave the master needs to know:- The identity of the slave, even if only to limit who can replicate (this will have to be specified)- Whether to expect an acknowledgement from the slave (as will this)- How long to wait forthe acknowledgement (this may be a default)- What the slave is expected to do before acknowledging (I think this should default to remote flush to disk - #3 in the mail which started this thread - since it prevents data loss without exposing the master to the possibility of locking delays) Additionally the process on the master requires:- How many acknowledgments to require before declaring success (defaulted to the number of servers expected to acknowledge since it will cause the fewest surprises when failing over to a replica)- What to do if the number of acknowledgments is not received (defaulting to abort/rollback since this is really what differentiates synchronous from asynchronous replication - the certainty that once data has been committed it can be recovered) So in order to set up synchronous replication all a DBA would have to specify is the slave server, that it is expected to send acknowledgments and possibly a timeout. If this is in fact a desirable state for the default behaviour or minimum settings requirement then I would say it is also a desirable target for the first patch. Alastair "Bell" Turner ^F5
On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > I guess that dropping the support of #3 doesn't reduce complexity > since the code of #3 is almost the same as that of #2. Like > walreceiver sends the ACK after receiving the WAL in #2 case, it has > only to do the same thing after the WAL flush. Hmm, well the code for #3 is similar also to the code for #4. So if you do #2, its easy to do #2, #3 and #4 together. The comment is about whether having #3 makes sense from a user interface perspective. It's easy to add options, but they must have useful meaning. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > > To summarise, I think we can get away with just 3 parameters: > > synchronous_replication = N # similar in name to synchronous_commit > > synch_rep_timeout = T > > synch_rep_timeout_action = commit | abort > > I agree to add the latter two parameters, which are also listed on > my outline of SynchRep. > http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability > > > Conceptually, this is "I want at least N replica copies made of my > > database changes, I will wait for up to T milliseconds to get that > > otherwise I will do X". Very easy and clear for an application to > > understand what guarantees it is requesting. Also very easy for the > > administrator to understand the guarantees requested and how to > > provision for them: to deliver robustness they typically need N+1 > > servers, or for even higher levels of robustness and performance N+2 > > etc.. > > I don't feel that "synchronous_replication" approach is intuitive for > the administrator. Even on this thread, some people seem to prefer > "per-standby" setting. Maybe they do, but that is because nobody has yet explained how you would handle failure modes with per-standby settings. When you do they will likely change their minds. Put the whole story on the table before trying to force a decision. > Without "per-standby" setting, when there are two standbys, one is in > the near rack and another is in remote site, "synchronous_replication=1" > cannot guarantee that the near standby is always synch with the master. > So when the master goes down, unfortunately we might have to failover to > the remote standby. If the remote server responded first, then that proves it is a better candidate for failover than the one you think of as near. If the two standbys vary over time then you have network problems that will directly affect the performance on the master; synch_rep = N would respond better to any such problems. > OTOH, "synchronous_replication=2" degrades the > performance on the master very much. Yes, but only because you have only one near standby. It would clearly to be foolish to make this setting without 2+ near standbys. We would then have 4 or more servers; how do we specify everything for that config?? > "synchronous_replication" approach > doesn't seem to cover the typical use case. You described the failure modes for the quorum proposal, but avoided describing the failure modes for the "per-standby" proposal. Please explain what will happen when the near server is unavailable, with per-standby settings. Please also explain what will happen if we choose to have 4 or 5 servers to maintain performance in case of the near server going down. How will we specify the failure modes? > Also, when "synchronous_replication=1" and one of synchronous standbys > goes down, how should the surviving standby catch up with the master? > Such standby might be too far behind the master. The transaction commit > should wait for the ACK from the lagging standby immediately even if > there might be large gap? If yes, "synch_rep_timeout" would screw up > the replication easily. That depends upon whether we send the ACK at point #2, #3 or #4. It would only cause a problem if you waited until #4. I've explained why I have made the proposals I've done so far: reduced complexity in failure modes and better user control. To understand that better, you or somebody needs to explain how we would handle the failure modes with "per-standby" settings so we can compare. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 07:10 -0400, Robert Haas wrote: > OK. In words of one syllable, your way still has all the same knobs, > plus some more. I explained how the per-standby settings would take many parameters, whereas per-transaction settings take far fewer. > You sketched out a design which still had a per-standby setting for > each standby, but IN ADDITION had a setting for a setting to control > quorum commit[1]. No, you misread it. Again. The parameters were not IN ADDITION - obviously so, otherwise I wouldn't claim there were fewer, would I? Your reply has again avoided the subject of how we would handle failure modes with per-standby settings. That is important. -- Simon Riggs www.2ndQuadrant.com
On Wed, May 26, 2010 at 9:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-05-26 at 07:10 -0400, Robert Haas wrote: > >> OK. In words of one syllable, your way still has all the same knobs, >> plus some more. > > I explained how the per-standby settings would take many parameters, > whereas per-transaction settings take far fewer. > >> You sketched out a design which still had a per-standby setting for >> each standby, but IN ADDITION had a setting for a setting to control >> quorum commit[1]. > > No, you misread it. Again. The parameters were not IN ADDITION - > obviously so, otherwise I wouldn't claim there were fewer, would I? Well, that does seem logical, but I can't figure out how to reconcile that with what you wrote before, because as far as I can see you're just saying over and over again that your way will need fewer parameters without explaining which parameters your way won't need. And frankly, I don't think it's possible for quorum commit to reduce the number of parameters. Even if we have that feature available, not everyone will want to use it. And the people who don't will presumably need whatever parameters they would have needed if quorum commit hadn't been available in the first place. > Your reply has again avoided the subject of how we would handle failure > modes with per-standby settings. That is important. I don't think anyone is avoiding that, we just haven't discussed it. The thing is, I don't think quorum commit actually does anything to address that problem. If I have a master and a standby configured for sync rep and the standby goes down, we have to decide what impact that has on the master. If I have a master and two standbys configured for sync rep with quorum commit such that I only need an ack from one of them, and they both go down, we still have to decide what impact that has on the master. I agree we need to talk about, but I don't agree that putting in quorum commit will remove the need to design that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 26/05/10 18:31, Robert Haas wrote: > And frankly, I don't think it's possible for quorum commit to reduce > the number of parameters. Even if we have that feature available, not > everyone will want to use it. And the people who don't will > presumably need whatever parameters they would have needed if quorum > commit hadn't been available in the first place. Agreed, quorum commit is not a panacea. For example, suppose that you have two servers, master and a standby, and you want transactions to be synchronously committed to both, so that in the event of a meteor striking the master, you don't lose any transactions that have been replied to the client as committed. Now you want to set up a temporary replica of the master at a development server, for testing purposes. If you set quorum to 2, your development server becomes critical infrastructure, which is not what you want. If you set quorum to 1, it also becomes critical infrastructure, because it's possible that a transaction has been replicated to the test server but not the real production standby, and a meteor strikes. Per-standby settings would let you express that, but not OTOH the quorum behavior where you require N out of M to acknowledge the commit before returning to client. There's really no limit to how complex a setup can be. For example, imagine that you have two data centers, with two servers in each. You want to replicate the master to all four servers, but for commit to return to the client, it's enough that the transaction has been replicated to one server in each data center. How do you express that in the config file? And it would be nice to have per-transaction control too, like with synchronous_commit... So this is a tradeoff between * flexibility, how complex a setup you can express? * code complexity, how complicated is it to implement? * user-friendliness, how easy is it to configure? One way out of this is to implement something very simple in PostgreSQL, and build external WAL proxying tools in pgfoundry that allow you to cascade and disseminate the WAL in as complex scenarios as you want. >> Your reply has again avoided the subject of how we would handle failure >> modes with per-standby settings. That is important. > > I don't think anyone is avoiding that, we just haven't discussed it. > The thing is, I don't think quorum commit actually does anything to > address that problem. If I have a master and a standby configured for > sync rep and the standby goes down, we have to decide what impact that > has on the master. If I have a master and two standbys configured for > sync rep with quorum commit such that I only need an ack from one of > them, and they both go down, we still have to decide what impact that > has on the master. I agree we need to talk about, but I don't agree > that putting in quorum commit will remove the need to design that > case. Right, failure modes need to be discussed, but how quorum commit or whatnot is configured is irrelevant to that. No-one has come up with a scheme on how to abort a transaction if you don't get a reply from a synchronous standby (or all standbys or a quorum of standbys). Until someone does, a commit on the master will have to always succeed. The "synchronous" aspect will provide a guarantee that if a standby is connected, any transaction in the master will become visible (or fsync'd or just streamed to, depending on the level) on the standby too before it's acknowledged as committed to the client, nothing more, nothing less. One way to do that would be to refrain from flushing the commit record to disk on the master until the standby has acknowledged it. The downside is that the master is in a very severe state at that point: until you flush the WAL, you can buffer only a small amount WAL traffic until you run out of wal_buffers, stalling all write activity in the master, with backends waiting. You can't even shut down the server cleanly. But if you value your transaction integrity much higher than availability, maybe that's what you want. PS. I whole-heartedly agree with Simon's concern upthread that if we allow a standby to specify in its config file that it wants to be a synchronous standby, that's a bit dangerous because connecting such a standby to the master will suddenly make all commits on the master a lot slower. Adding a synchronous standby should require some action in the master, since it affects the behavior on master. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > One way to do that would be to refrain from flushing the commit > record to disk on the master until the standby has acknowledged > it. I'm not clear on the benefit of doing that, versus flushing the commit record and then waiting for responses. Either way some databases will commit before others -- what is the benefit of having the master lag? > Adding a synchronous standby should require some action in the > master, since it affects the behavior on master. +1 -Kevin
On 26/05/10 20:10, Kevin Grittner wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > >> One way to do that would be to refrain from flushing the commit >> record to disk on the master until the standby has acknowledged >> it. > > I'm not clear on the benefit of doing that, versus flushing the > commit record and then waiting for responses. Either way some > databases will commit before others -- what is the benefit of having > the master lag? Hmm, I was going to answer that that way no other transactions can see the transaction as committed before it has been safely replicated, but I now realize that you could also flush, but refrain from releasing the entry from procarray until the standby acknowledges the commit, so the transaction would look like in-progress to other transactions in the master until that. Although, if the master crashes at that point, and quickly recovers, you could see the last transactions committed on the master before they're replicated to the standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-05-26 at 11:31 -0400, Robert Haas wrote: > > Your reply has again avoided the subject of how we would handle failure > > modes with per-standby settings. That is important. > > I don't think anyone is avoiding that, we just haven't discussed it. You haven't discussed it, but even before you do, you know its better. Not very compelling perspective... > The thing is, I don't think quorum commit actually does anything to > address that problem. If I have a master and a standby configured for > sync rep and the standby goes down, we have to decide what impact that > has on the master. If I have a master and two standbys configured for > sync rep with quorum commit such that I only need an ack from one of > them, and they both go down, we still have to decide what impact that > has on the master. That's already been discussed, and AFAIK Masao and I already agreed on how that would be handled in the quorum commit case. What we haven't had explained is how you would handle all the sub cases or failure modes for the per-standby situation. The most common case for synch rep IMHO is this: * 2 near standbys, 1 remote. Want to be able to ACK to first near standby that responds, or if both are down, ACK to the remote. I've proposed a way of specifying that with 3 simple parameters, e.g. synch_rep_acks = 1 synch_rep_timeout = 30 synch_rep_timeout_action = commit In Oracle this would be all of the following * all nodes given unique names DB_UNIQUE_NAME=master DB_UNIQUE_NAME=near1 DB_UNIQUE_NAME=near2 DB_UNIQUE_NAME=remote * parameter settings LOG_ARCHIVE_CONFIG='DG_CONFIG=(master,near1, near2, remote)' LOG_ARCHIVE_DEST_2='SERVICE=near1 SYNC AFFIRM NET_TIMEOUT=30 DB_UNIQUE_NAME=near1' LOG_ARCHIVE_DEST_STATE_2='ENABLE' LOG_ARCHIVE_DEST_3='SERVICE=near2 SYNC AFFIRM NET_TIMEOUT=30 DB_UNIQUE_NAME=near2' LOG_ARCHIVE_DEST_STATE_3='ENABLE' LOG_ARCHIVE_DEST_4='SERVICE=remote ASYNC NOAFFIRM DB_UNIQUE_NAME=remote' LOG_ARCHIVE_DEST_STATE_4='ENABLE' * modes ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY; The Oracle way doesn't allow you to specify that if near1 and near2 are down then we should continue to SYNC via remote, nor does it allow you to specify things from user perspective or at transaction level. You don't need to do it that way, for sure. But we do need to say what way you would pick, rather than just arguing against me before you've even discussed it here or off-list. > I agree we need to talk about, but I don't agree > that putting in quorum commit will remove the need to design that > case. Yes, you need to design for that case. It's not a magic wand. All I've said is that covering the common cases is easier and more flexible by choosing transaction-centric style of parameters, and it also allows user settable behaviour. I want to do better than Oracle, if possible, using lessons learned. I don't want to do the same thing because we're copying them or because we're going down the same conceptual dead end they went down. We should try to avoid doing something obvious and aim a little higher. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 12:10 -0500, Kevin Grittner wrote: > > Adding a synchronous standby should require some action in the > > master, since it affects the behavior on master. > > +1 +1 -- Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Although, if the master crashes at that point, and quickly > recovers, you could see the last transactions committed on the > master before they're replicated to the standby. Versus having the transaction committed on one or more slaves but not on the master? Unless we have a transaction manager and do proper distributed transactions, how do you avoid edge conditions like that? -Kevin
On Wed, 2010-05-26 at 19:55 +0300, Heikki Linnakangas wrote: > Now you want to set up a temporary replica of the master at a > development server, for testing purposes. If you set quorum to 2, your > development server becomes critical infrastructure, which is not what > you want. That's a good argument for standby relays. Nobody hooks in a disposable test machine into a critical production config without expecting it to have some effect. > If you set quorum to 1, it also becomes critical > infrastructure, because it's possible that a transaction has been > replicated to the test server but not the real production standby, and > a meteor strikes. Why would you not want to use the test server? If its the only thing left protecting you, and you wish to be protected, then it sounds very cool to me. In my proposal this test server only gets data ahead of other things if the "real production standby" responds too slowly. It scares the **** out of people that a DBA can take down a server and suddenly the sync protection you thought you had is turned off. That way of doing things means an application never knows the protection level any piece of data has had. App designers want to be able to marks things "handle with care" or "just do it quick, don't care much". It's a real pain to have to handle all your data the same, and for that to be selectable only by administrators, who may or may not have everything configured correctly/available. -- Simon Riggs www.2ndQuadrant.com
On Wed, May 26, 2010 at 1:24 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 26/05/10 20:10, Kevin Grittner wrote: >> >> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: >> >>> One way to do that would be to refrain from flushing the commit >>> record to disk on the master until the standby has acknowledged >>> it. >> >> I'm not clear on the benefit of doing that, versus flushing the >> commit record and then waiting for responses. Either way some >> databases will commit before others -- what is the benefit of having >> the master lag? > > Hmm, I was going to answer that that way no other transactions can see the > transaction as committed before it has been safely replicated, but I now > realize that you could also flush, but refrain from releasing the entry from > procarray until the standby acknowledges the commit, so the transaction > would look like in-progress to other transactions in the master until that. > > Although, if the master crashes at that point, and quickly recovers, you > could see the last transactions committed on the master before they're > replicated to the standby. No matter what you do, there's going to be corner cases where one node thinks the transaction committed and the other node doesn't know. At any given time, we're either in a state where a crash and restart on the master will replay the commit record, or we're not. And also, but somewhat independently, we're in a state where a crash on the standby will replay the commit record, or we're not. Each of these is dependent on a disk write, and there's no way to guarantee that both of those disk writes succeed or both of them fail. Now, in theory, maybe you could have a system where we don't have a fixed definition of who the master is. If either server crashes or if they lose communication, both crash. If they both come back up, they agree on who has the higher LSN on disk and both roll forward to that point, then designate one server to be the master. If one comes back up and can't reach the other, it appeals to the clusterware for help. The clusterware is then responsible for shooting one node in the head and telling the other node to carry on as the sole survivor. When, eventually, the dead node is resurrected, it *discards* any WAL written after the point from which the new master restarted. Short of that, I don't think "abort the transaction" is a recovery mechanism for when we can't get hold of a standby. We're going to have to commit locally first and then we can decide how long to wait for an ACK that a standby has also committed the same transaction remotely. We can wait not at all, forever, or for a while and then declare the other guy dead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 26/05/10 20:33, Kevin Grittner wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > >> Although, if the master crashes at that point, and quickly >> recovers, you could see the last transactions committed on the >> master before they're replicated to the standby. > > Versus having the transaction committed on one or more slaves but > not on the master? Unless we have a transaction manager and do > proper distributed transactions, how do you avoid edge conditions > like that? Yeah, I guess you can't. You can guarantee that a commit is always safely flushed first in the master, or in the standby, but without two-phase commit you can't guarantee atomicity. It's useful to know which behavior you get, though, so that you can take it into account in your failover procedure. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Unless we have a transaction manager and do proper distributed >> transactions, how do you avoid edge conditions like that? > > Yeah, I guess you can't. You can guarantee that a commit is > always safely flushed first in the master, or in the standby, but > without two-phase commit you can't guarantee atomicity. It's > useful to know which behavior you get, though, so that you can > take it into account in your failover procedure. It strikes me that if you always write the commit for the master first, there's at least a possibility of developing a heuristic for getting a slave back in sync should the connection break. If you randomly update zero to N slaves and then have a failure, I don't see much hope. -Kevin
On 26/05/10 20:40, Simon Riggs wrote: > On Wed, 2010-05-26 at 19:55 +0300, Heikki Linnakangas wrote: >> If you set quorum to 1, it also becomes critical >> infrastructure, because it's possible that a transaction has been >> replicated to the test server but not the real production standby, and >> a meteor strikes. > > Why would you not want to use the test server? Because your failover procedures known nothing about the test server. Even if the data is there in theory, it'd be completely impractical to fetch it from there. > If its the only thing > left protecting you, and you wish to be protected, then it sounds very > cool to me. In my proposal this test server only gets data ahead of > other things if the "real production standby" responds too slowly. There's many reasons why a test server could respond faster than the production standby. Maybe the standby is on a different continent. Maybe you have fsync=off on the test server because it's just a test server. Either way, you want the master to ignore it for the purpose of determining if a commit is safe. > It scares the **** out of people that a DBA can take down a server and > suddenly the sync protection you thought you had is turned off. Yeah, it depends on what you're trying to accomplish. If durability is absolutely critical to you, (vs. availability), you don't want the commit to ever be acknowledged to the client until it's safely flushed to disk in the standby, even if it means refusing any further commits on the master, until the standby reconnects and catches up. OTOH, if you're not that worried about durability, but you're load balancing queries to the standby, you want to ensure that when you run a query against the standby, a transaction that committed on the master is also visible in the standby. In that scenario, if a standby can't be reached, it is simply pronounced dead, and the master can just ignore it until it reconnects. > That way > of doing things means an application never knows the protection level > any piece of data has had. App designers want to be able to marks things > "handle with care" or "just do it quick, don't care much". Yeah, that's useful too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, May 26, 2010 at 1:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-05-26 at 11:31 -0400, Robert Haas wrote: >> > Your reply has again avoided the subject of how we would handle failure >> > modes with per-standby settings. That is important. >> >> I don't think anyone is avoiding that, we just haven't discussed it. > > You haven't discussed it, but even before you do, you know its better. > Not very compelling perspective... I don't really understand this comment. I have said, and I believe, that a system without quorum commit is simpler than one with quorum commit. I'd debate the point with you but I find the point so self-evident that I don't even know where to begin arguing it. I am not saying, and I have not said, that we shouldn't have quorum commit.I am saying that it is not something that we need toadd as part of an initial sync rep patch, because we can instead add it in a follow-on patch. As far as I can tell, we are not trying to decide between two competing approaches and therefore do not need to decide which one is better. Everything you are proposing sounds useful and valuable. I am not sure whether it handles all of the use cases that folks might have. For example, Heikki mentioned the case upthread of wanting to wait for a commit ACK from one of two servers in the data center A and one of two servers in data center B, rather than just two out of four servers total. So, we might need to think a little bit about whether we want to handle those kinds of cases and what sort of infrastructure we would need to support it. But certainly I think quorum commit sounds like a good feature and I hope it will be included in 9.1, or if not 9.1, then some future version. What I don't agree with is that it needs to be part of the initial implementation of sync rep. If there's a case for doing that, I don't believe it's been made on this thread. At any rate, the fact that I don't see it as a sine qua non for sync rep is neither obstructionism nor an ad hominem attack. It's simply an opinion, which I believe to be based on solid technical reasoning, but which I might change my mind about if someone convinces me that I'm looking at the problem the wrong way. That would an involve someone making an argument of the following form: "If we don't implement quorum commit in the very first implementation of sync rep, then it will be hard to add later because X." So far no one has done that. You have made the similar argument "If we do implement quorum commit in the very first version of sync rep, it will save implementation work elsewhere" - but I don't think that's true and I have explained why. >> The thing is, I don't think quorum commit actually does anything to >> address that problem. If I have a master and a standby configured for >> sync rep and the standby goes down, we have to decide what impact that >> has on the master. If I have a master and two standbys configured for >> sync rep with quorum commit such that I only need an ack from one of >> them, and they both go down, we still have to decide what impact that >> has on the master. > > That's already been discussed, and AFAIK Masao and I already agreed on > how that would be handled in the quorum commit case. I can't find that in the thread. Anyway, again, you're going to probably want the same options there that you will in the "master with one standby" case, which I personally think is going to be a LOT more common than any other configuration. > [configuration example] > The Oracle way doesn't allow you to specify that if near1 and near2 are > down then we should continue to SYNC via remote, nor does it allow you > to specify things from user perspective or at transaction level. > > You don't need to do it that way, for sure. But we do need to say what > way you would pick, rather than just arguing against me before you've > even discussed it here or off-list. Well, again, I am not arguing and have not argued that we shouldn't do it, just that we shouldn't do it UNTIL we get the basic stuff working.On the substance of the design, the Oracle way doesn'tlook that bad in terms of syntax (I suspect we'll end up with some of the same knobs), but certainly I agree that it would be nice to do some of the things they can't which you have detailed here. I just don't want us to bite off more than we can chew. Then we might end up with nothing, which would suck. >> I agree we need to talk about, but I don't agree >> that putting in quorum commit will remove the need to design that >> case. > > Yes, you need to design for that case. It's not a magic wand. > > All I've said is that covering the common cases is easier and more > flexible by choosing transaction-centric style of parameters, and it > also allows user settable behaviour. One of the ideas you proposed upthread, in terms of transaction-centric behavior, is having an individual transaction be able to ask for a weaker integrity guarantee than whatever the default is. I think that is both a very good idea and probably something we should implement relatively early on - though still maybe not in the first patch. I think there are a lot of people who will have SOME transactions (user transfers money to other user) that absolutely have to be durable and other transactions (user logs in) that we can risk losing in the event of a crash. > I want to do better than Oracle, if possible, using lessons learned. I > don't want to do the same thing because we're copying them or because > we're going down the same conceptual dead end they went down. We should > try to avoid doing something obvious and aim a little higher. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-05-26 at 14:30 -0400, Robert Haas wrote: > On Wed, May 26, 2010 at 1:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2010-05-26 at 11:31 -0400, Robert Haas wrote: > >> > Your reply has again avoided the subject of how we would handle failure > >> > modes with per-standby settings. That is important. > >> > >> I don't think anyone is avoiding that, we just haven't discussed it. > > > > You haven't discussed it, but even before you do, you know its better. > > Not very compelling perspective... > > I don't really understand this comment. I have said, and I believe, > that a system without quorum commit is simpler than one with quorum > commit. I'd debate the point with you but I find the point so > self-evident that I don't even know where to begin arguing it. > It's simply an opinion, which I believe to > be based on solid technical reasoning, but which I might change my > mind about if someone convinces me that I'm looking at the problem the > wrong way. You're saying you have solid technical reasons, but they are so self-evident that you can't even begin to argue them. Why are you so sure your reasons are solid?? Regrettably, I say this doesn't make any sense, however much you write. The decision may already have been made in your eyes, but the community still has options as to how to proceed, whether or not Masao has already written this. Zoltan has already presented a patch that follows my proposal, so there are alternate valid paths which we can decide between. It's not a matter of opinion as to which is easier to code cos its already done; you can run the patch and see. (No comment on other parts of that patch). The alternative is an approach that hasn't even been presented itself fully on list, with many unanswered questions. I've thought about this myself and discussed my reasons on list for the past two years. If you can read all I've presented to the community and come up with a better way, great, we'll all be happy. -- Simon Riggs www.2ndQuadrant.com
On Wed, May 26, 2010 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I don't really understand this comment. I have said, and I believe, >> that a system without quorum commit is simpler than one with quorum >> commit. I'd debate the point with you but I find the point so >> self-evident that I don't even know where to begin arguing it. > >> It's simply an opinion, which I believe to >> be based on solid technical reasoning, but which I might change my >> mind about if someone convinces me that I'm looking at the problem the >> wrong way. > > You're saying you have solid technical reasons, but they are so > self-evident that you can't even begin to argue them. Why are you so > sure your reasons are solid?? Regrettably, I say this doesn't make any > sense, however much you write. Yeah, especially when you juxtapose two different parts of my email that were talking about different things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-05-26 at 15:37 -0400, Robert Haas wrote: > On Wed, May 26, 2010 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I don't really understand this comment. I have said, and I believe, > >> that a system without quorum commit is simpler than one with quorum > >> commit. I'd debate the point with you but I find the point so > >> self-evident that I don't even know where to begin arguing it. > > > >> It's simply an opinion, which I believe to > >> be based on solid technical reasoning, but which I might change my > >> mind about if someone convinces me that I'm looking at the problem the > >> wrong way. > > > > You're saying you have solid technical reasons, but they are so > > self-evident that you can't even begin to argue them. Why are you so > > sure your reasons are solid?? Regrettably, I say this doesn't make any > > sense, however much you write. > > Yeah, especially when you juxtapose two different parts of my email > that were talking about different things. /me rings bell and orders the fighters to their respective corners. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-05-26 at 19:55 +0300, Heikki Linnakangas wrote: >> Now you want to set up a temporary replica of the master at a >> development server, for testing purposes. If you set quorum to 2, your >> development server becomes critical infrastructure, which is not what >> you want. > > That's a good argument for standby relays. Well it seems to me we can have the best of both worlds as soon as we have cascading support. Even in the test server example, this one would be a slave of the main slave, not counted into the quorum on the master. Now that's the quorum on the slave that would be deciding on the availability of the test server. Set it down to 0 and your test server has no impact on the production environment. In the example of one master and 4 slaves in 2 different locations, you'll have a quorum of 2 on the master, which will know about 2 slaves only. And each of them will have 1 slave, with a quorum to set to 0 or 1 depending on what you want to achieve. So if you want simplicity to admin, effective data availability and precise control over the global setup, I say go for:a. transaction level control of the replication levelb. cascading supportc.quorum with timeoutd. choice of commit or rollback at timeout Then give me a setup example that you can't express fully. As far as the options to control the whole thing are concerned, I think that the cascading support does not add any. So that's 3 GUCs. Regards, -- dim
On 26/05/10 23:31, Dimitri Fontaine wrote: > d. choice of commit or rollback at timeout Rollback is not an option. There is no going back after the commit record has been flushed to disk or sent to a standby. The choice is to either commit anyway after the timeout, or wait forever. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/26/2010 12:55 PM, Heikki Linnakangas wrote: > On 26/05/10 18:31, Robert Haas wrote: >> And frankly, I don't think it's possible for quorum commit to reduce >> the number of parameters. Even if we have that feature available, not >> everyone will want to use it. And the people who don't will >> presumably need whatever parameters they would have needed if quorum >> commit hadn't been available in the first place. > > Agreed, quorum commit is not a panacea. > > For example, suppose that you have two servers, master and a standby, > and you want transactions to be synchronously committed to both, so that > in the event of a meteor striking the master, you don't lose any > transactions that have been replied to the client as committed. > > Now you want to set up a temporary replica of the master at a > development server, for testing purposes. If you set quorum to 2, your > development server becomes critical infrastructure, which is not what > you want. If you set quorum to 1, it also becomes critical > infrastructure, because it's possible that a transaction has been > replicated to the test server but not the real production standby, and a > meteor strikes. > > Per-standby settings would let you express that, but not OTOH the quorum > behavior where you require N out of M to acknowledge the commit before > returning to client. You can do this only with per standby options, by giving each standby a weight, or a number of votes. Your DEV server would have a weight of zero, while your production standby's have higher weights, depending on their importance for your overall infrastructure. As long as majority means >50% of all votes in the house, you don't have a split brain risk. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 26/05/10 23:31, Dimitri Fontaine wrote: > So if you want simplicity to admin, effective data availability and > precise control over the global setup, I say go for: > a. transaction level control of the replication level > b. cascading support > c. quorum with timeout > d. choice of commit or rollback at timeout > > Then give me a setup example that you can't express fully. One master, one synchronous standby on another continent for HA purposes, and one asynchronous reporting server in the same rack as the master. You don't want to set up the reporting server as a cascaded slave of the standby on the other continent, because that would double the bandwidth required, but you also don't want the master to wait for the reporting server. The possibilities are endless... Your proposal above covers a pretty good set of scenarios, but it's by no means complete. If we try to solve everything the configuration will need to be written in a Turing-complete Replication Description Language. We'll have to pick a useful, easy-to-understand subset that covers the common scenarios. To handle the more exotic scenarios, you can write a proxy that sits in front of the master, and implements whatever rules you wish, with the rules written in C. BTW, I think we're going to need a separate config file for listing the standbys anyway. There you can write per-server rules and options, but explicitly knowing about all the standbys also allows the master to recycle WAL as soon as it has been streamed to all the registered standbys. Currently we just keep wal_keep_segments files around, just in case there's a standby out there that needs them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-05-26 at 17:31 -0400, Jan Wieck wrote: > You can do this only with per standby options, by giving each standby a > weight, or a number of votes. Your DEV server would have a weight of > zero, while your production standby's have higher weights, depending on > their importance for your overall infrastructure. As long as majority > means >50% of all votes in the house, you don't have a split brain risk. Yes, you could do that with per-standby options. If you give each standby a weight then the parameter has much less meaning for the user. It doesn't mean number of replicas any more, it means something else with local and changeable meaning. A fractional quorum suffers the same way. What would make some sense would be to have an option for "vote=0|1" so that a standby would never take part in the transaction sync when vote=0. But you still have the problem of specifying rules if insufficient servers with vote=1 are available. The reaction to that is to supply more servers with vote=1, though then you need a way to specify how many servers with vote=1 you care about. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 00:21 +0300, Heikki Linnakangas wrote: > On 26/05/10 23:31, Dimitri Fontaine wrote: > > d. choice of commit or rollback at timeout > > Rollback is not an option. There is no going back after the commit > record has been flushed to disk or sent to a standby. There's definitely no going back after the xid has been removed from procarray because other transactions will then depend upon the final state. Currently we PANIC if we abort after we've marked clog, though that happens after XLogFlush(), which is where we're planning to wait for synch rep. If we abort after having written a commit record to disk we can still successfully generate an abort record as well. (Luckily, I note HS does actually cope with that. Phew!) So actually, an abort is a reasonable possibility, though I know it doesn't sound like it could be at first thought. > The choice is to either commit anyway after the timeout, or wait forever. Hmm, wait forever. What happens if we try to shutdown fast while there is a transaction that is waiting forever? Is that then a commit, even though it never made it to the standby? How would we know it was safe to switchover or not? Hmm. Oracle offers options of COMMIT | SHUTDOWN in this case. -- Simon Riggs www.2ndQuadrant.com
On 27/05/10 01:23, Simon Riggs wrote: > On Thu, 2010-05-27 at 00:21 +0300, Heikki Linnakangas wrote: >> On 26/05/10 23:31, Dimitri Fontaine wrote: >>> d. choice of commit or rollback at timeout >> >> Rollback is not an option. There is no going back after the commit >> record has been flushed to disk or sent to a standby. > > There's definitely no going back after the xid has been removed from > procarray because other transactions will then depend upon the final > state. Currently we PANIC if we abort after we've marked clog, though > that happens after XLogFlush(), which is where we're planning to wait > for synch rep. If we abort after having written a commit record to disk > we can still successfully generate an abort record as well. (Luckily, I > note HS does actually cope with that. Phew!) > > So actually, an abort is a reasonable possibility, though I know it > doesn't sound like it could be at first thought. Hmm, that's an interesting thought. Interesting, as in crazy ;-). I don't understand how HS could handle that. As soon as it sees the commit record, the transaction becomes visible to readers. >> The choice is to either commit anyway after the timeout, or wait forever. > > Hmm, wait forever. What happens if we try to shutdown fast while there > is a transaction that is waiting forever? Is that then a commit, even > though it never made it to the standby? How would we know it was safe to > switchover or not? Hmm. Refuse to shut down until the standby acknowledges the commit. That's the only way to be sure.. In practice, hard synchronous "don't return ever until the commit hits the standby" behavior is rarely what admins actually want, because it's disastrous from an availability point of view. More likely, admins want "wait for ack from standby, unless it's not responding, in which case to hell with redundancy and just act like a single server". It makes sense if you just want to make sure that the standby doesn't return stale results when it's working properly, and you're not worried about durability but I'm not sure it's very sound otherwise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, May 26, 2010 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > >> I guess that dropping the support of #3 doesn't reduce complexity >> since the code of #3 is almost the same as that of #2. Like >> walreceiver sends the ACK after receiving the WAL in #2 case, it has >> only to do the same thing after the WAL flush. > > Hmm, well the code for #3 is similar also to the code for #4. So if you > do #2, its easy to do #2, #3 and #4 together. No. #4 requires the way of prompt communication between walreceiver and startup process, but #2 and #3 not. That is, in #4, walreceiver has to wake the startup process up as soon as it has flushed WAL. OTOH, the startup process has to wake walreceiver up as soon as it has replayed WAL, to request it to send the ACK to the master. In #2 and #3, the prompt communication from walreceiver to startup process, i.e., changing the poll loop in the startup process would also be useful for the data to be visible immediately on the standby. But it's not required. > The comment is about whether having #3 makes sense from a user interface > perspective. It's easy to add options, but they must have useful > meaning. #3 would be useful for people wanting further robustness. In #2, when simultaneous power failure on the master and the standby, and concurrent disk crash on the master happen, transaction whose "success" indicator has been returned to a client might be lost. #3 can avoid such a critical situation. This is one of reasons that DRBD supports "Protocol C", I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2010-05-26 at 15:37 -0400, Robert Haas wrote: > On Wed, May 26, 2010 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I don't really understand this comment. I have said, and I believe, > >> that a system without quorum commit is simpler than one with quorum > >> commit. I'd debate the point with you but I find the point so > >> self-evident that I don't even know where to begin arguing it. > > > >> It's simply an opinion, which I believe to > >> be based on solid technical reasoning, but which I might change my > >> mind about if someone convinces me that I'm looking at the problem the > >> wrong way. > > > > You're saying you have solid technical reasons, but they are so > > self-evident that you can't even begin to argue them. Why are you so > > sure your reasons are solid?? Regrettably, I say this doesn't make any > > sense, however much you write. > > Yeah, especially when you juxtapose two different parts of my email > that were talking about different things. /me rings bell and orders the fighters to their respective corners. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote: > On Wed, May 26, 2010 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > > > >> I guess that dropping the support of #3 doesn't reduce complexity > >> since the code of #3 is almost the same as that of #2. Like > >> walreceiver sends the ACK after receiving the WAL in #2 case, it has > >> only to do the same thing after the WAL flush. > > > > Hmm, well the code for #3 is similar also to the code for #4. So if you > > do #2, its easy to do #2, #3 and #4 together. > > No. #4 requires the way of prompt communication between walreceiver and > startup process, but #2 and #3 not. That is, in #4, walreceiver has to > wake the startup process up as soon as it has flushed WAL. OTOH, the > startup process has to wake walreceiver up as soon as it has replayed > WAL, to request it to send the ACK to the master. In #2 and #3, the > prompt communication from walreceiver to startup process, i.e., changing > the poll loop in the startup process would also be useful for the data > to be visible immediately on the standby. But it's not required. You need to pass WAL promptly on primary from backend to WALSender. Whatever mechanism you use can also be reused symmetrically on standby to provide #4. So not a problem. > > The comment is about whether having #3 makes sense from a user interface > > perspective. It's easy to add options, but they must have useful > > meaning. > > #3 would be useful for people wanting further robustness. In #2, > when simultaneous power failure on the master and the standby, > and concurrent disk crash on the master happen, transaction whose > "success" indicator has been returned to a client might be lost. > #3 can avoid such a critical situation. This is one of reasons that > DRBD supports "Protocol C", I think. Which few people use it, or if they do its because DRBD didn't originally support multiple standbys. Not worth emulating IMHO. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 02:18 +0300, Heikki Linnakangas wrote: > On 27/05/10 01:23, Simon Riggs wrote: > > On Thu, 2010-05-27 at 00:21 +0300, Heikki Linnakangas wrote: > >> On 26/05/10 23:31, Dimitri Fontaine wrote: > >>> d. choice of commit or rollback at timeout > >> > >> Rollback is not an option. There is no going back after the commit > >> record has been flushed to disk or sent to a standby. > > > > There's definitely no going back after the xid has been removed from > > procarray because other transactions will then depend upon the final > > state. Currently we PANIC if we abort after we've marked clog, though > > that happens after XLogFlush(), which is where we're planning to wait > > for synch rep. If we abort after having written a commit record to disk > > we can still successfully generate an abort record as well. (Luckily, I > > note HS does actually cope with that. Phew!) > > > > So actually, an abort is a reasonable possibility, though I know it > > doesn't sound like it could be at first thought. > Hmm, that's an interesting thought. Interesting, as in crazy ;-). :-) It's a surprising thought for me also. > I don't understand how HS could handle that. As soon as it sees the > commit record, the transaction becomes visible to readers. I meant not-barf completely. > >> The choice is to either commit anyway after the timeout, or wait forever. > > > > Hmm, wait forever. What happens if we try to shutdown fast while there > > is a transaction that is waiting forever? Is that then a commit, even > > though it never made it to the standby? How would we know it was safe to > > switchover or not? Hmm. > > Refuse to shut down until the standby acknowledges the commit. That's > the only way to be sure.. > > In practice, hard synchronous "don't return ever until the commit hits > the standby" behavior is rarely what admins actually want, because it's > disastrous from an availability point of view. More likely, admins want > "wait for ack from standby, unless it's not responding, in which case to > hell with redundancy and just act like a single server". It makes sense > if you just want to make sure that the standby doesn't return stale > results when it's working properly, and you're not worried about > durability but I'm not sure it's very sound otherwise. Which is also crazy. If you're using synch rep its because you care deeply about durability. Some people wish to treat the COMMIT as a guarantee, not just a shrug. I agree that don't-return-ever isn't something anyone will want. What we need is a "COMMIT with ERROR" message! Note that Oracle gives the options of COMMIT | SHUTDOWN at this point. Shutdown is an implicit abort for the writing transaction... At this point the primary thinks standby is no longer available. If we have a split brain situation then we should be assuming we will STONITH and shutdown the primary anyway. If we have more than one standby we can stay up and probably shouldn't be sending an abort after a commit. The trouble is *every* option is crazy from some perspective, so we must consider them all, to see whether they are practical or impractical. -- Simon Riggs www.2ndQuadrant.com
On 27/05/10 09:51, Simon Riggs wrote: > On Thu, 2010-05-27 at 02:18 +0300, Heikki Linnakangas wrote: >> In practice, hard synchronous "don't return ever until the commit hits >> the standby" behavior is rarely what admins actually want, because it's >> disastrous from an availability point of view. More likely, admins want >> "wait for ack from standby, unless it's not responding, in which case to >> hell with redundancy and just act like a single server". It makes sense >> if you just want to make sure that the standby doesn't return stale >> results when it's working properly, and you're not worried about >> durability but I'm not sure it's very sound otherwise. > > Which is also crazy. If you're using synch rep its because you care > deeply about durability. No, not necessarily. As I said above, you might just want a guarantee that *if* you query the standby, you get up-to-date results. But if the standby is down for any reason, you don't care about it. That's a very sensible mode of operation, for example if you're offloading reads to the standby with something like pgpool. In fact I have the feeling that that's the most common use case for synchronous replication, not a deep concern of durability. > I agree that don't-return-ever isn't something anyone will want. > > What we need is a "COMMIT with ERROR" message! Hmm, perhaps we could emit a warning with the commit. I'm not sure what an application could do with it, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, May 26, 2010 at 10:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If the remote server responded first, then that proves it is a better > candidate for failover than the one you think of as near. If the two > standbys vary over time then you have network problems that will > directly affect the performance on the master; synch_rep = N would > respond better to any such problems. No. The remote standby might respond first temporarily though it's almost behind the near one. The read-only queries or incrementally updated backup operation might cause a bursty disk write, and delay the ACK from the standby. The lock contention between read-only queries and recovery would delay the ACK. So the standby which responds first is not always the best candidate for failover. Also the administrator generally doesn't put the remote standby under the control of a clusterware like heartbeat. In this case, the remote standby will never be the candidate for failover. But quorum commit cannot cover this simple case. >> OTOH, "synchronous_replication=2" degrades the >> performance on the master very much. > > Yes, but only because you have only one near standby. It would clearly > to be foolish to make this setting without 2+ near standbys. We would > then have 4 or more servers; how do we specify everything for that > config?? If you always want to use the near standby as the candidate for failover by using quorum commit in the above simple case, you would need to choose such a foolish setting. Otherwise, unfortunately you might have to failover to the remote standby not under the control of a clusterware. >> "synchronous_replication" approach >> doesn't seem to cover the typical use case. > > You described the failure modes for the quorum proposal, but avoided > describing the failure modes for the "per-standby" proposal. > > Please explain what will happen when the near server is unavailable, > with per-standby settings. Please also explain what will happen if we > choose to have 4 or 5 servers to maintain performance in case of the > near server going down. How will we specify the failure modes? I'll try to explain that. (1) most standard case: 1 master + 1 "sync" standby (near) When the master goes down, something like a clusterware detectsthat failure, and brings the standby online. Since we can ensure that the standby has all the committed transactions,failover doesn't cause any data loss. When the standby goes down or network outage happens, walsender detects that failure via the replication timeout, keepaliveor error return from the system calls. Then walsender does something according to the specified reaction (GUC)to the failure of the standby, e.g., walsender wakes the transaction commit up from the wait-for-ACK, and exits.Then the master runs standalone. (2) 1 master + 1 "sync" standby (near) + 1 "async" standby (remote) When the master goes down, something like a clusterwarebrings the "sync" standby in the near location online. The administrator would need to take a fresh base backupof the new master, load it on the remote standby, change the primary_conninfo, and restart the remote standby. When one of standbys goes down, walsender does the same thing described in (1). Until the failed standby has restarted,the master runs together with another standby. In (1) and (2), after some failure happens, there would be only one server which is guaranteed to have all the committed transactions. When it also goes down, the database service stops. If you want to avoid this fragile situation, you would need to add one more "sync" standby in the near site. (3) 1 master + 2 "sync" standbys (near) + 1 "async" standby (remote) When the master goes down, something like a clusterwarebrings the one of "sync" standbys online by using some selection algorithm. The administrator would need totake a fresh base backup of the new master, load it on both remaining standbys, change the primary_conninfo, and restartthem. When one of standbys goes down, walsender does the same thing described in (1). Until the failed standby has restarted,the master runs together with two standbys. At least one standby is guaranteed to be sync with the master. Is this explanation enough? >> Also, when "synchronous_replication=1" and one of synchronous standbys >> goes down, how should the surviving standby catch up with the master? >> Such standby might be too far behind the master. The transaction commit >> should wait for the ACK from the lagging standby immediately even if >> there might be large gap? If yes, "synch_rep_timeout" would screw up >> the replication easily. > > That depends upon whether we send the ACK at point #2, #3 or #4. It > would only cause a problem if you waited until #4. Yeah, the problem happens. If we implement quorum commit, we need to design how the surviving standby catches up with the master. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 3:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote: >> On Wed, May 26, 2010 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: >> > >> >> I guess that dropping the support of #3 doesn't reduce complexity >> >> since the code of #3 is almost the same as that of #2. Like >> >> walreceiver sends the ACK after receiving the WAL in #2 case, it has >> >> only to do the same thing after the WAL flush. >> > >> > Hmm, well the code for #3 is similar also to the code for #4. So if you >> > do #2, its easy to do #2, #3 and #4 together. >> >> No. #4 requires the way of prompt communication between walreceiver and >> startup process, but #2 and #3 not. That is, in #4, walreceiver has to >> wake the startup process up as soon as it has flushed WAL. OTOH, the >> startup process has to wake walreceiver up as soon as it has replayed >> WAL, to request it to send the ACK to the master. In #2 and #3, the >> prompt communication from walreceiver to startup process, i.e., changing >> the poll loop in the startup process would also be useful for the data >> to be visible immediately on the standby. But it's not required. > > You need to pass WAL promptly on primary from backend to WALSender. > Whatever mechanism you use can also be reused symmetrically on standby > to provide #4. So not a problem. I cannot be so optimistic since the situation differs from one process to another. >> > The comment is about whether having #3 makes sense from a user interface >> > perspective. It's easy to add options, but they must have useful >> > meaning. >> >> #3 would be useful for people wanting further robustness. In #2, >> when simultaneous power failure on the master and the standby, >> and concurrent disk crash on the master happen, transaction whose >> "success" indicator has been returned to a client might be lost. >> #3 can avoid such a critical situation. This is one of reasons that >> DRBD supports "Protocol C", I think. > > Which few people use it, or if they do its because DRBD didn't > originally support multiple standbys. Not worth emulating IMHO. If so, #3 would be useful for people who don't afford to buy more than one standby servers, too :) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 26/05/10 23:31, Dimitri Fontaine wrote: >> So if you want simplicity to admin, effective data availability and >> precise control over the global setup, I say go for: >> a. transaction level control of the replication level >> b. cascading support >> c. quorum with timeout >> d. choice of commit or rollback at timeout >> >> Then give me a setup example that you can't express fully. > > One master, one synchronous standby on another continent for HA purposes, > and one asynchronous reporting server in the same rack as the master. You > don't want to set up the reporting server as a cascaded slave of the standby > on the other continent, because that would double the bandwidth required, > but you also don't want the master to wait for the reporting server. > > The possibilities are endless... Your proposal above covers a pretty good > set of scenarios, but it's by no means complete. If we try to solve > everything the configuration will need to be written in a Turing-complete > Replication Description Language. We'll have to pick a useful, > easy-to-understand subset that covers the common scenarios. To handle the > more exotic scenarios, you can write a proxy that sits in front of the > master, and implements whatever rules you wish, with the rules written > in C. Agreed on the Turing-completeness side of those things. My current thinking is that the proxy I want might simply be a PostgreSQL instance with cascading support. In your example that would give us: Remote Standby, HA Master -- Proxy -< Local Standby, Reporting So what I think we have here is a pretty good trade-off in terms of what you can do with some simple setup knobs. What's left there is that with the quorum idea, you're not sure if the one server that's synced is the remote or local standby, in this example. Several ideas are floating around (votes, mixed per-standby and per-transaction settings). Maybe we could have the standby be able to say it's not interesting into participating into the quorum, that is, it's an async replica, full stop. In your example we'd set the local reporting standby as a non-voting member of the replication setting, the proxy and the master would have a quorum of 1, and the remote HA standby would vote. I don't think the idea of having any number of voting coupons other than 0 or 1 on any server will help us the least. I do think that your proxy idea is a great one and should be in core. By the way, the cascading/proxy instance could be set without Hot Standby, if you don't like to be able to monitor it via a libpq connection and some queries. > BTW, I think we're going to need a separate config file for listing the > standbys anyway. There you can write per-server rules and options, but > explicitly knowing about all the standbys also allows the master to recycle > WAL as soon as it has been streamed to all the registered > standbys. Currently we just keep wal_keep_segments files around, just in > case there's a standby out there that needs them. I much prefer that each server in the set publish what it wants. It only connects to 1 given provider. Then we've been talking about this exact same retention problem for queueing solutions, with Jan, Marko and Jim. The idea we came up with is a watermarking solution (which already exists in Skytools 3, in its coarse-grain version). The first approach is to have all slave give back to its local master/provider/origin the last replayed WAL/LSN, once in a while. You derive from that a global watermark and drop WAL files depending on it. You now have two problems: no more space and why keeping that many files on the master anyway, maybe some slave could be set up for retention instead? To solve that it's possible for each server to be setup with a restricted set of servers they're deriving their watermark from. That's when you need per-server options and an explicit list of all the standbys whatever their level in the cascading tree. That means explicit maintenance of the entire replication topology. I don't think we need to solve that already. I think we need to provide an option on each member of the replication tree to either PANIC or lose WALs in case they're running out of space when trying to follow the watermark. It's crude but already allows to have a standby set to maintain the common archive and have the master drop the WAL files as soon as possible (respecting wal_keep_segments). In our case, if a WAL file is no more available from any active server we still have the option to fetch it from the archives... Regards, -- Dimitri Fontaine PostgreSQL DBA, Architecte
On Thu, 2010-05-27 at 10:09 +0300, Heikki Linnakangas wrote: > No, not necessarily. As I said above, you might just want a guarantee > that *if* you query the standby, you get up-to-date results. Of course. COMMIT was already one of the options, so this comment was already understood. What we are discussing is whether additional options exist and/or are desirable. We should not be forcing everybody to COMMIT whether or not it is robust. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 16:35 +0900, Fujii Masao wrote: > On Thu, May 27, 2010 at 3:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote: > >> On Wed, May 26, 2010 at 10:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > >> > > >> >> I guess that dropping the support of #3 doesn't reduce complexity > >> >> since the code of #3 is almost the same as that of #2. Like > >> >> walreceiver sends the ACK after receiving the WAL in #2 case, it has > >> >> only to do the same thing after the WAL flush. > >> > > >> > Hmm, well the code for #3 is similar also to the code for #4. So if you > >> > do #2, its easy to do #2, #3 and #4 together. > >> > >> No. #4 requires the way of prompt communication between walreceiver and > >> startup process, but #2 and #3 not. That is, in #4, walreceiver has to > >> wake the startup process up as soon as it has flushed WAL. OTOH, the > >> startup process has to wake walreceiver up as soon as it has replayed > >> WAL, to request it to send the ACK to the master. In #2 and #3, the > >> prompt communication from walreceiver to startup process, i.e., changing > >> the poll loop in the startup process would also be useful for the data > >> to be visible immediately on the standby. But it's not required. > > > > You need to pass WAL promptly on primary from backend to WALSender. > > Whatever mechanism you use can also be reused symmetrically on standby > > to provide #4. So not a problem. > > I cannot be so optimistic since the situation differs from one process > to another. This spurs some architectural thinking: I think we need to disconnect the idea of waiting in any of the components. Anytime we ask WALSender or WALReceiver to wait for acknowledgement we will be reducing throughput. So we should assume that they will continue to work as quickly as possible. The acknowledgement from standby can contain the latest xlog location of WAL received, WAL written to disk and WAL applied, all by reading values from shared memory. It's all the same, whether we send back 2 or 3 xlog locations in the ack message. Who sends the ack message? Who receives it? Would it be easier to have this happen in a second pair of processes WALSynchroniser (on primary) and WAL Acknowledger (on standby). WALAcknowledger would send back a stream of ack messages with latest xlog positions. WALSynchroniser would receive these messages and wake up sleeping backends. If we did that then there'd be almost no change at all to existing code, just additional code and processes for the sync case. Code would be separate and there would be no performance concerns either. Backends can then choose to wait until the xlog location they wish has been achieved which might be in the next acknowledgement message or in a subsequent one. That also ensures that the logic for this is completely on the master and the standby doesn't act differently, apart from needing to start a WALAcknowledger process if sync rep is requested. If you do choose to make #3 important, then I'd say you need to work out how to make WALWriter active as well, so it can perform regular fsyncs, rather than having WALReceiver wait across that I/O. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 16:13 +0900, Fujii Masao wrote: > On Wed, May 26, 2010 at 10:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Please explain what will happen when the near server is unavailable, > > with per-standby settings. Please also explain what will happen if we > > choose to have 4 or 5 servers to maintain performance in case of the > > near server going down. How will we specify the failure modes? > > I'll try to explain that. We've been discussing parameters and how we would define what we want to happen in various scenarios. You've not explained what parameters you would use, how and where they would be set, so we aren't yet any closer to understanding what it is your proposing. Please explain how your proposal will work. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 16:13 +0900, Fujii Masao wrote: > On Wed, May 26, 2010 at 10:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > If the remote server responded first, then that proves it is a better > > candidate for failover than the one you think of as near. If the two > > standbys vary over time then you have network problems that will > > directly affect the performance on the master; synch_rep = N would > > respond better to any such problems. > > No. The remote standby might respond first temporarily though it's almost > behind the near one. The read-only queries or incrementally updated > backup operation might cause a bursty disk write, and delay the ACK from > the standby. The lock contention between read-only queries and recovery > would delay the ACK. So the standby which responds first is not always > the best candidate for failover. Seems strange. If you have 2 standbys and you say you would like node1 to be the preferred candidate, then you load it so heavily that a remote server with by-definition much larger network delay responds first, then I say your preference was wrong. The above situation is caused by the DBA and the DBA can solve it also - if the preference is to keep a "preferred" server then that server would need to be lightly loaded so it could respond sensibly. This is the same thing as having an optimizer pick the best path and then the user saying "no dumb-ass, use the index I tell you" even though it is slower. If you really don't want to know the fastest way, then I personally will agree you can have that, as is my view (now) on the optimizer issue also - sometimes the admin does know best. > Also the administrator generally doesn't > put the remote standby under the control of a clusterware like heartbeat. > In this case, the remote standby will never be the candidate for failover. > But quorum commit cannot cover this simple case. If you, Jan and Yeb wish to completely exclude standbys from being part of any quorum, then I guess we need to have per-standby settings to allow that to be defined. I'm in favour of giving people options. That needn't be a mandatory per-standby setting, just a non-default option, so that we can reduce the complexity of configuration for common cases. If we're looking for simplest-implementation-first that isn't it. Currently, Oracle provides these settings, which correspond to Maximum Performance => quorum = 0 Maximum Availability => quorum = 1, timeout_action = commit Maximum Protection => quorum = 1, timeout_action = shutdown So Oracle already supports the quorum case... Oracle doesn't provide i) any capability to have quorum > 1 ii) any capability to include an async node as a sync node, if the quorum cannot be reached with servers marked "sync", or in the situation where because of mis-use/mis-configuration the sync servers are actually slower. iii) ability to wait for apply iv) ability to specify wait mode at transaction level all of those are desirable in some cases and easily possible by specifying things in the way I've suggested. -- Simon Riggs www.2ndQuadrant.com
On Thu, May 27, 2010 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Who sends the ack message? walreceiver > Who receives it? walsender > Would it be easier to have > this happen in a second pair of processes WALSynchroniser (on primary) > and WAL Acknowledger (on standby). WALAcknowledger would send back a > stream of ack messages with latest xlog positions. WALSynchroniser would > receive these messages and wake up sleeping backends. If we did that > then there'd be almost no change at all to existing code, just > additional code and processes for the sync case. Code would be separate > and there would be no performance concerns either. No, this seems to be bad idea. We should not establish extra connection between servers. That would be a source of trouble. > If you do choose to make #3 important, then I'd say you need to work out > how to make WALWriter active as well, so it can perform regular fsyncs, > rather than having WALReceiver wait across that I/O. Yeah, this might be an option for optimization though I'm not sure how it has good effect. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote: > On Thu, May 27, 2010 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Would it be easier to have > > this happen in a second pair of processes WALSynchroniser (on primary) > > and WAL Acknowledger (on standby). WALAcknowledger would send back a > > stream of ack messages with latest xlog positions. WALSynchroniser would > > receive these messages and wake up sleeping backends. If we did that > > then there'd be almost no change at all to existing code, just > > additional code and processes for the sync case. Code would be separate > > and there would be no performance concerns either. > > No, this seems to be bad idea. We should not establish extra connection > between servers. That would be a source of trouble. What kind of trouble? You think using an extra connection would cause problems; why? I've explained it would greatly simplify the code to do it that way and improve performance. Those sound like good things, not problems. > > If you do choose to make #3 important, then I'd say you need to work out > > how to make WALWriter active as well, so it can perform regular fsyncs, > > rather than having WALReceiver wait across that I/O. > > Yeah, this might be an option for optimization though I'm not sure how > it has good effect. As I said, WALreceiver would not need to wait across fsync... -- Simon Riggs www.2ndQuadrant.com
On Thu, May 27, 2010 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Seems strange. If you have 2 standbys and you say you would like node1 > to be the preferred candidate, then you load it so heavily that a remote > server with by-definition much larger network delay responds first, then > I say your preference was wrong. The above situation is caused by the > DBA and the DBA can solve it also - if the preference is to keep a > "preferred" server then that server would need to be lightly loaded so > it could respond sensibly. No. Even if the load is very low in the "preferred" server, there is *no* guarantee that it responds first. Per-standby setting can give such a guarantee, i.e., we can specify #2, #3 or #4 in the "preferred" server and #1 in the other. > This is the same thing as having an optimizer pick the best path and > then the user saying "no dumb-ass, use the index I tell you" even though > it is slower. If you really don't want to know the fastest way, then I > personally will agree you can have that, as is my view (now) on the > optimizer issue also - sometimes the admin does know best. I think that choice of wrong master causes more serious situation than that of wrong plan. >> Also the administrator generally doesn't >> put the remote standby under the control of a clusterware like heartbeat. >> In this case, the remote standby will never be the candidate for failover. >> But quorum commit cannot cover this simple case. > > If you, Jan and Yeb wish to completely exclude standbys from being part > of any quorum, then I guess we need to have per-standby settings to > allow that to be defined. I'm in favour of giving people options. That > needn't be a mandatory per-standby setting, just a non-default option, > so that we can reduce the complexity of configuration for common cases. > If we're looking for simplest-implementation-first that isn't it. For now, I agree that we support a quorum commit feature for 9.1 or later. But I don't think that it's simpler, more intuitive and easier-to-understand than per-standby setting. So I think that we should include the per-standby setting in the first patch. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 7:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote: >> On Thu, May 27, 2010 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > Would it be easier to have >> > this happen in a second pair of processes WALSynchroniser (on primary) >> > and WAL Acknowledger (on standby). WALAcknowledger would send back a >> > stream of ack messages with latest xlog positions. WALSynchroniser would >> > receive these messages and wake up sleeping backends. If we did that >> > then there'd be almost no change at all to existing code, just >> > additional code and processes for the sync case. Code would be separate >> > and there would be no performance concerns either. >> >> No, this seems to be bad idea. We should not establish extra connection >> between servers. That would be a source of trouble. > > What kind of trouble? You think using an extra connection would cause > problems; why? Because the number of connection failure cases doubles. Likewise, the number of process failure cases would double. >> > If you do choose to make #3 important, then I'd say you need to work out >> > how to make WALWriter active as well, so it can perform regular fsyncs, >> > rather than having WALReceiver wait across that I/O. >> >> Yeah, this might be an option for optimization though I'm not sure how >> it has good effect. > > As I said, WALreceiver would not need to wait across fsync... Right, but walreceiver still needs to wait for WAL flush by walwriter. If currently WAL flush is the dominant workload for walreceiver, only leaving it to walwriter might not have so good effect. I'm not sure whether. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Simon Riggs <simon@2ndQuadrant.com> writes: > Seems strange. If you have 2 standbys and you say you would like node1 > to be the preferred candidate, then you load it so heavily that a remote > server with by-definition much larger network delay responds first, then > I say your preference was wrong. There's a communication mismatch here I think. The problem with the dynamic aspect of the system is that the admin wants to plan ahead and choose in advance the failover server. Other than that I much prefer the automatic and dynamic quorum idea. > If you, Jan and Yeb wish to completely exclude standbys from being part > of any quorum, then I guess we need to have per-standby settings to > allow that to be defined. I'm in favour of giving people options. That > needn't be a mandatory per-standby setting, just a non-default option, > so that we can reduce the complexity of configuration for common > cases. +1 > Maximum Performance => quorum = 0 > Maximum Availability => quorum = 1, timeout_action = commit > Maximum Protection => quorum = 1, timeout_action = shutdown +1 Being able to say that a given server has not been granted to participate into the vote allowing to reach the global durability quorum will allow for choosing the failover candidates. Now you're able to have this reporting server and know for sure that your sync replicated transactions are not waiting for it. To summarize, the current "per-transaction approach" would be : - transaction level replication synchronous behaviour- proxy/cascading in core- quorum setup for deciding any commit is safe-any server can be excluded from the sync quorum- timeout can still raises exception or ignore (commit)? This last point seems to need some more discussion, or I didn't understand well the current positions and proposals. Regards, -- dim
On Thu, May 27, 2010 at 3:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > (1) most standard case: 1 master + 1 "sync" standby (near) > When the master goes down, something like a clusterware detects that > failure, and brings the standby online. Since we can ensure that the > standby has all the committed transactions, failover doesn't cause > any data loss. How do you propose to guarantee that? ISTM that you have to either commit locally first, or send the commit to the remote first. Either way, the two events won't occur exactly simultaneously. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, 2010-05-27 at 20:13 +0900, Fujii Masao wrote: > On Thu, May 27, 2010 at 7:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote: > >> On Thu, May 27, 2010 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > >> > Would it be easier to have > >> > this happen in a second pair of processes WALSynchroniser (on primary) > >> > and WAL Acknowledger (on standby). WALAcknowledger would send back a > >> > stream of ack messages with latest xlog positions. WALSynchroniser would > >> > receive these messages and wake up sleeping backends. If we did that > >> > then there'd be almost no change at all to existing code, just > >> > additional code and processes for the sync case. Code would be separate > >> > and there would be no performance concerns either. > >> > >> No, this seems to be bad idea. We should not establish extra connection > >> between servers. That would be a source of trouble. > > > > What kind of trouble? You think using an extra connection would cause > > problems; why? > > Because the number of connection failure cases doubles. Likewise, the number > of process failure cases would double. Not really. The users wait for just the synchroniser to return not for two things. Looks to me that other processes are independent of each other. Very simple. > >> > If you do choose to make #3 important, then I'd say you need to work out > >> > how to make WALWriter active as well, so it can perform regular fsyncs, > >> > rather than having WALReceiver wait across that I/O. > >> > >> Yeah, this might be an option for optimization though I'm not sure how > >> it has good effect. > > > > As I said, WALreceiver would not need to wait across fsync... > > Right, but walreceiver still needs to wait for WAL flush by walwriter. Why does it? I just explained a design where that wasn't required. > If currently WAL flush is the dominant workload for walreceiver, > only leaving it to walwriter might not have so good effect. I'm not sure > whether. If we're not sure, we could check before agreeing a design. WAL flush will be costly unless you have huge disk cache. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-05-27 at 19:50 +0900, Fujii Masao wrote: > For now, I agree that we support a quorum commit feature for 9.1 or later. > But I don't think that it's simpler, more intuitive and easier-to-understand > than per-standby setting. So I think that we should include the per-standby > setting in the first patch. There already is a first patch to the community that implements quorum commit, just not by you. If you have a better way, describe it in detail and in full now, with reference to each of the use cases you mentioned, so that people get a chance to give their opinions on your design. Then we can let the community decide whether or not that second way is actually better. We may not need a second patch. -- Simon Riggs www.2ndQuadrant.com
On Thu, May 27, 2010 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 27, 2010 at 3:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> (1) most standard case: 1 master + 1 "sync" standby (near) >> When the master goes down, something like a clusterware detects that >> failure, and brings the standby online. Since we can ensure that the >> standby has all the committed transactions, failover doesn't cause >> any data loss. > > How do you propose to guarantee that? ISTM that you have to either > commit locally first, or send the commit to the remote first. Either > way, the two events won't occur exactly simultaneously. Letting the transaction wait until the standby has received / flushed / replayed the WAL before it returns a "success" indicator to a client would guarantee that. This ensures that all transactions which a client knows as committed exist in the memory or disk of the standby. So we would be able to see those transactions from new master after failover. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 8:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Why does it? I just explained a design where that wasn't required. Hmm.. my expression might have been ambiguous. Walreceiver still needs to wait for WAL flush by walwriter *before* sending the ACK to the master, in #3 case. Because, in #3, the master has to wait until the standby has flushed the WAL. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 8:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > There already is a first patch to the community that implements quorum > commit, just not by you. Yeah, AFAIK, that patch includes also per-standby setting. > If you have a better way, describe it in detail and in full now, with > reference to each of the use cases you mentioned, so that people get a > chance to give their opinions on your design. Then we can let the > community decide whether or not that second way is actually better. We > may not need a second patch. See http://archives.postgresql.org/pgsql-hackers/2010-05/msg01407.php But I think that we should focus on "per-standby" setting at first. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 8:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, May 27, 2010 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, May 27, 2010 at 3:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> (1) most standard case: 1 master + 1 "sync" standby (near) >>> When the master goes down, something like a clusterware detects that >>> failure, and brings the standby online. Since we can ensure that the >>> standby has all the committed transactions, failover doesn't cause >>> any data loss. >> >> How do you propose to guarantee that? ISTM that you have to either >> commit locally first, or send the commit to the remote first. Either >> way, the two events won't occur exactly simultaneously. > > Letting the transaction wait until the standby has received / flushed / > replayed the WAL before it returns a "success" indicator to a client > would guarantee that. This ensures that all transactions which a client > knows as committed exist in the memory or disk of the standby. So we > would be able to see those transactions from new master after failover. There could still be additional transactions that the original master has committed locally but were not acked to the client. I guess you'd just work around that by taking a new base backup from the new master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, May 27, 2010 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There could still be additional transactions that the original master > has committed locally but were not acked to the client. I guess you'd > just work around that by taking a new base backup from the new master. Right. Unfortunately the transaction aborted for a client might have already been committed in the standby. In this case, we might need to eliminate the mismatch of transaction status between a client and new master after failover. BTW, the similar situation might happen even when only one server is running. If the server goes down before returning a "success" to a client after flushing the commit record, the mismatch would happen after restart of the server. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, May 27, 2010 at 9:09 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, May 27, 2010 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> There could still be additional transactions that the original master >> has committed locally but were not acked to the client. I guess you'd >> just work around that by taking a new base backup from the new master. > > Right. > > Unfortunately the transaction aborted for a client might have already > been committed in the standby. In this case, we might need to eliminate > the mismatch of transaction status between a client and new master > after failover. > > BTW, the similar situation might happen even when only one server is > running. If the server goes down before returning a "success" to a > client after flushing the commit record, the mismatch would happen > after restart of the server. True. But that's a slightly different case. Clients could fail to receive commit ACKs for a variety of reasons, like losing network connectivity momentarily. They had better be prepared for that no matter whether replication is in use or not. The new issue that replication adds is that you've got to make sure that the two (or n) nodes don't disagree with each other. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Simon Riggs wrote: > On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote: > > > I guess that dropping the support of #3 doesn't reduce complexity > > since the code of #3 is almost the same as that of #2. Like > > walreceiver sends the ACK after receiving the WAL in #2 case, it has > > only to do the same thing after the WAL flush. > > Hmm, well the code for #3 is similar also to the code for #4. So if you > do #2, its easy to do #2, #3 and #4 together. > > The comment is about whether having #3 makes sense from a user interface > perspective. It's easy to add options, but they must have useful > meaning. If the slave is runing read-only queries, #3 is the most reliable option withouth delaying the slave, so there is a usecase for #3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
Heikki Linnakangas wrote: > BTW, I think we're going to need a separate config file for listing the > standbys anyway. There you can write per-server rules and options, but > explicitly knowing about all the standbys also allows the master to > recycle WAL as soon as it has been streamed to all the registered > standbys. Currently we just keep wal_keep_segments files around, just in > case there's a standby out there that needs them. Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode' on the master, and allow the sync/async mode to be set on each slave, e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then two slaves with sync mode of #2 or stricter have to complete before the master can continue. Naming the slaves on the master seems very confusing because I am unclear how we would identify named slaves, and the names have to match, etc. Also, what would be cool would be if you could run a query on the master to view the SR commit mode of each slave. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On May 27, 2010, at 4:31 PM, Bruce Momjian <bruce@momjian.us> wrote: > Heikki Linnakangas wrote: >> BTW, I think we're going to need a separate config file for listing >> the >> standbys anyway. There you can write per-server rules and options, >> but >> explicitly knowing about all the standbys also allows the master to >> recycle WAL as soon as it has been streamed to all the registered >> standbys. Currently we just keep wal_keep_segments files around, >> just in >> case there's a standby out there that needs them. > > Ideally we could set 'slave_sync_count' and > 'slave_commit_continue_mode' > on the master, and allow the sync/async mode to be set on each slave, > e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then > two slaves with sync mode of #2 or stricter have to complete before > the > master can continue. > > Naming the slaves on the master seems very confusing because I am > unclear how we would identify named slaves, and the names have to > match, > etc. The names could be configured with a GUC on the slaves, or we could base it on the login role. ...Robert
Heikki Linnakangas wrote: > The possibilities are endless... Your proposal above covers a pretty > good set of scenarios, but it's by no means complete. If we try to > solve everything the configuration will need to be written in a > Turing-complete Replication Description Language. We'll have to pick a > useful, easy-to-understand subset that covers the common scenarios. To > handle the more exotic scenarios, you can write a proxy that sits in > front of the master, and implements whatever rules you wish, with the > rules written in C. I was thinking about this a bit recently. As I see it, there are three fundamental parts of this: 1) We have a transaction that is being committed. The rest of the computations here are all relative to it. 2) There is an (internal?) table that lists the state of each replication target relative to that transaction. It would include the node name, perhaps some metadata ('location' seems the one that's most likely to help with the remote data center issue), and a state code. The codes from http://wiki.postgresql.org/wiki/Streaming_Replication work fine for the last part (which is the only dynamic one--everything else is static data being joined against): async=hasn't received yet recv=been received but just in RAM fsync=received and synced to disk apply=applied to the database These would need to be enums so they can be ordered from lesser to greater consistency. So in a 3 node case, the internal state table might look like this after a bit of data had been committed: node | location | state ---------------------------------- a | local | fsync b | remote | recv c | remote | async This means that the local node has a fully persistent copy, but the best either remote one has done is received the data, it's not on disk at all yet at the remote data center. Still working its way through. 3) The decision about whether the data has been committed to enough places to be considered safe by the master is computed by a function that is passed this internal table as something like a SRF, and it returns a boolean. Once that returns true, saying it's satisfied, the transaction closes on the master and continues to percolate out from there. If it's false, we wait for another state change to come in and return to (2). I would propose that most behaviors someone has expressed as being their desired implementation is possible to implement using this scheme. -Semi-sync commit: proceed as soon somebody else has a copy and hope the copies all become consistent: EXISTS WHERE state>=recv -Don't proceed until there's a fsync'd commit on at least one of the remote nodes: EXISTS WHERE location='remote' AND state>=fsync -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end; Syntax is obviously rough but I think you can get the drift of what I'm suggesting. While exposing the local state and running this computation isn't free, in situations where there truly are remote nodes in here being communicated with the network overhead is going to dwarf that. If there were a fast path for the simplest cases and this complicated one for the rest, I think you could get the fully programmable behavior some people want using simple SQL, rather than having to write a new "Replication Description Language" or something so ambitious. This data about what's been replicated to where looks an awful lot like a set of rows you can operate on using features already in the database to me. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 02/06/10 10:22, Greg Smith wrote: > Heikki Linnakangas wrote: >> The possibilities are endless... Your proposal above covers a pretty >> good set of scenarios, but it's by no means complete. If we try to >> solve everything the configuration will need to be written in a >> Turing-complete Replication Description Language. We'll have to pick a >> useful, easy-to-understand subset that covers the common scenarios. To >> handle the more exotic scenarios, you can write a proxy that sits in >> front of the master, and implements whatever rules you wish, with the >> rules written in C. > > I was thinking about this a bit recently. As I see it, there are three > fundamental parts of this: > > 1) We have a transaction that is being committed. The rest of the > computations here are all relative to it. Agreed. > So in a 3 node case, the internal state table might look like this after > a bit of data had been committed: > > node | location | state > ---------------------------------- > a | local | fsync b | remote | recv > c | remote | async > > This means that the local node has a fully persistent copy, but the best > either remote one has done is received the data, it's not on disk at all > yet at the remote data center. Still working its way through. > > 3) The decision about whether the data has been committed to enough > places to be considered safe by the master is computed by a function > that is passed this internal table as something like a SRF, and it > returns a boolean. Once that returns true, saying it's satisfied, the > transaction closes on the master and continues to percolate out from > there. If it's false, we wait for another state change to come in and > return to (2). You can't implement "wait for X to ack the commit, but if that doesn't happen in Y seconds, time out and return true anyway" with that. > While exposing the local state and running this computation isn't free, > in situations where there truly are remote nodes in here being > communicated with the network overhead is going to dwarf that. If there > were a fast path for the simplest cases and this complicated one for the > rest, I think you could get the fully programmable behavior some people > want using simple SQL, rather than having to write a new "Replication > Description Language" or something so ambitious. This data about what's > been replicated to where looks an awful lot like a set of rows you can > operate on using features already in the database to me. Yeah, if we want to provide full control over when a commit is acknowledged to the client, there's certainly no reason we can't expose that using a hook or something. It's pretty scary to call a user-defined function at that point in transaction. Even if we document that you must refrain from doing nasty stuff like modifying tables in that function, it's still scary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-06-02 at 03:22 -0400, Greg Smith wrote: > Heikki Linnakangas wrote: > > The possibilities are endless... Your proposal above covers a pretty > > good set of scenarios, but it's by no means complete. If we try to > > solve everything the configuration will need to be written in a > > Turing-complete Replication Description Language. We'll have to pick a > > useful, easy-to-understand subset that covers the common scenarios. To > > handle the more exotic scenarios, you can write a proxy that sits in > > front of the master, and implements whatever rules you wish, with the > > rules written in C. > > I was thinking about this a bit recently. As I see it, there are three > fundamental parts of this: > > 1) We have a transaction that is being committed. The rest of the > computations here are all relative to it. > > 2) There is an (internal?) table that lists the state of each > replication target relative to that transaction. It would include the > node name, perhaps some metadata ('location' seems the one that's most > likely to help with the remote data center issue), and a state code. > The codes from http://wiki.postgresql.org/wiki/Streaming_Replication > work fine for the last part (which is the only dynamic one--everything > else is static data being joined against): > > async=hasn't received yet > recv=been received but just in RAM > fsync=received and synced to disk > apply=applied to the database > > These would need to be enums so they can be ordered from lesser to > greater consistency. > > So in a 3 node case, the internal state table might look like this after > a bit of data had been committed: > > node | location | state > ---------------------------------- > a | local | fsync > b | remote | recv > c | remote | async > > This means that the local node has a fully persistent copy, but the best > either remote one has done is received the data, it's not on disk at all > yet at the remote data center. Still working its way through. > > 3) The decision about whether the data has been committed to enough > places to be considered safe by the master is computed by a function > that is passed this internal table as something like a SRF, and it > returns a boolean. Once that returns true, saying it's satisfied, the > transaction closes on the master and continues to percolate out from > there. If it's false, we wait for another state change to come in and > return to (2). > > I would propose that most behaviors someone has expressed as being their > desired implementation is possible to implement using this scheme. > > -Semi-sync commit: proceed as soon somebody else has a copy and hope > the copies all become consistent: EXISTS WHERE state>=recv > -Don't proceed until there's a fsync'd commit on at least one of the > remote nodes: EXISTS WHERE location='remote' AND state>=fsync > -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT > COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end; > > Syntax is obviously rough but I think you can get the drift of what I'm > suggesting. > > While exposing the local state and running this computation isn't free, > in situations where there truly are remote nodes in here being > communicated with the network overhead is going to dwarf that. If there > were a fast path for the simplest cases and this complicated one for the > rest, I think you could get the fully programmable behavior some people > want using simple SQL, rather than having to write a new "Replication > Description Language" or something so ambitious. This data about what's > been replicated to where looks an awful lot like a set of rows you can > operate on using features already in the database to me. I think we're all agreed on the 4 levels: async, recv, fsync, apply. I also like the concept of a synchronisation/wakeup rule as an abstract concept. Certainly makes things easier to discuss. The inputs to the wakeup rule can be defined in different ways. Holding per-node state at local level looks too complex to me. I'm not suggesting that we need both per-node AND per-transaction options interacting at the same time. (That would be a clear argument against per-transaction options, if that was a requirement - its not, for me). There seems to be a simpler way: a service oriented model. The transaction requests a minimum level of synchronisation, the standbys together service that request. A simple, clear process: 1. If transaction requests recv, fsync or apply, backend sleeps in the appropriate queue 2. An agent on behalf of the remote standby provides feedback according to the levels of service defined for that standby. 3. The agent calls a wakeup-rule to see if the backend can be woken yet The most basic rule is "first-standby-wakes" meaning that the first standby to provide feedback that the required synchronisation level has been met by at least one standby will cause the rule to fire. The next most basic thing is that some standbys can be marked as not taking part in the quorum and are not capable of waking people with certain levels of request. Rules can record intermediate data, to allow us to wait until multiple agents have provided feedback. Another rule might be "wait for all standbys marked apply". Though that has complex behaviour in failure conditions. Other more complex rules are possible. That can be defined explicitly with some DDL/RDL, or we could use a plugin, or just hardcode some options. That level of complexity is secondary and can be added later. That is especially easy once we have the concept of a synchronisation wakeup rule as Greg describes here. Being able to apply synchronisation level at transaction level is very important. All other systems only provide a single level of synchronisation, making application design difficult. It will be a compelling feature for application designers to be able to make their reference data updated at APPLY, so always consistent everywhere, while other important data is FSYNC, and less critical data is ASYNC. It's a real pain to have to partition applications because of their synchronisation requirements. -- Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > It's pretty scary to call a user-defined function at that point in > transaction. Not so much "pretty scary" as "zero chance of being accepted". And I do mean zero. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> It's pretty scary to call a user-defined function at that point in >> transaction. >> > > Not so much "pretty scary" as "zero chance of being accepted". > And I do mean zero. > I swear, you guys are such buzzkills some days. I was suggesting a model for building easy prototypes, and advocating a more formal way to explain, in what could be code form, what someone means when they suggest a particular quorum model or the like. Maybe all that will ever be exposed into a production server are the best of the hand-written implementations, and the scary "try your prototype here" hook only shows up in debug builds, or never gets written at all. I did comment that I expected faster built-in implementations to be the primary way these would be handled. From what Heikki said, it sounds like the main thing I was didn't remember is to include some timestamp information to allow rules based on that information too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 5/27/2010 4:31 PM, Bruce Momjian wrote: > Heikki Linnakangas wrote: >> BTW, I think we're going to need a separate config file for listing the >> standbys anyway. There you can write per-server rules and options, but >> explicitly knowing about all the standbys also allows the master to >> recycle WAL as soon as it has been streamed to all the registered >> standbys. Currently we just keep wal_keep_segments files around, just in >> case there's a standby out there that needs them. > > Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode' > on the master, and allow the sync/async mode to be set on each slave, > e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then > two slaves with sync mode of #2 or stricter have to complete before the > master can continue. > > Naming the slaves on the master seems very confusing because I am > unclear how we would identify named slaves, and the names have to match, > etc. > > Also, what would be cool would be if you could run a query on the master > to view the SR commit mode of each slave. What would be the use case for such a query? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > On 5/27/2010 4:31 PM, Bruce Momjian wrote: >> >> Heikki Linnakangas wrote: >>> >>> BTW, I think we're going to need a separate config file for listing the >>> standbys anyway. There you can write per-server rules and options, but >>> explicitly knowing about all the standbys also allows the master to recycle >>> WAL as soon as it has been streamed to all the registered standbys. >>> Currently we just keep wal_keep_segments files around, just in case there's >>> a standby out there that needs them. >> >> Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode' >> on the master, and allow the sync/async mode to be set on each slave, >> e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then >> two slaves with sync mode of #2 or stricter have to complete before the >> master can continue. >> >> Naming the slaves on the master seems very confusing because I am >> unclear how we would identify named slaves, and the names have to match, >> etc. >> Also, what would be cool would be if you could run a query on the master >> to view the SR commit mode of each slave. > > What would be the use case for such a query? Monitoring? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Jun 03, 2010 at 10:57:05PM -0400, Robert Haas wrote: > On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > > What would be the use case for such a query? > > Monitoring? s/\?/!/; Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 6/3/2010 10:57 PM, Robert Haas wrote: > On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> On 5/27/2010 4:31 PM, Bruce Momjian wrote: >>> Also, what would be cool would be if you could run a query on the master >>> to view the SR commit mode of each slave. >> >> What would be the use case for such a query? > > Monitoring? So that justifies adding code, that the community needs to maintain and document, to the core system. If only I could find some monitoring case for transaction commit orders ... sigh! Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Fri, Jun 4, 2010 at 3:35 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > On 6/3/2010 10:57 PM, Robert Haas wrote: >> >> On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >>> >>> On 5/27/2010 4:31 PM, Bruce Momjian wrote: >>>> >>>> Also, what would be cool would be if you could run a query on the master >>>> to view the SR commit mode of each slave. >>> >>> What would be the use case for such a query? >> >> Monitoring? > > So that justifies adding code, that the community needs to maintain and > document, to the core system. If only I could find some monitoring case for > transaction commit orders ... sigh! Dude, I'm not the one arguing with you... actually I don't think anyone really is, any more, except about details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 6/4/2010 4:22 PM, Robert Haas wrote: > On Fri, Jun 4, 2010 at 3:35 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> So that justifies adding code, that the community needs to maintain and >> document, to the core system. If only I could find some monitoring case for >> transaction commit orders ... sigh! > > Dude, I'm not the one arguing with you... actually I don't think > anyone really is, any more, except about details. I know. You actually pretty much defend my case. Sorry for lacking smiley's. This is an old habit I have. A good friend from Germany once suspected one of my emails to be a spoof because I actually used a smiley. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Hi, Dimitri Fontaine írta: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> Seems strange. If you have 2 standbys and you say you would like node1 >> to be the preferred candidate, then you load it so heavily that a remote >> server with by-definition much larger network delay responds first, then >> I say your preference was wrong. >> > > There's a communication mismatch here I think. The problem with the > dynamic aspect of the system is that the admin wants to plan ahead and > choose in advance the failover server. > > Other than that I much prefer the automatic and dynamic quorum idea. > > >> If you, Jan and Yeb wish to completely exclude standbys from being part >> of any quorum, then I guess we need to have per-standby settings to >> allow that to be defined. I'm in favour of giving people options. That >> needn't be a mandatory per-standby setting, just a non-default option, >> so that we can reduce the complexity of configuration for common >> cases. >> > > +1 > > >> Maximum Performance => quorum = 0 >> Maximum Availability => quorum = 1, timeout_action = commit >> Maximum Protection => quorum = 1, timeout_action = shutdown >> > > +1 > > Being able to say that a given server has not been granted to > participate into the vote allowing to reach the global durability quorum > will allow for choosing the failover candidates. > > Now you're able to have this reporting server and know for sure that > your sync replicated transactions are not waiting for it. > > To summarize, the current "per-transaction approach" would be : > > - transaction level replication synchronous behaviour > Sorry for answering such an old mail, but what is the purpose of a transaction level synchronous behaviour if async transactions can be held back by a sync transaction? In my patch, when the transactions were waiting for ack from the standby, they have already released all their locks, the wait happened at the latest possible point in CommitTransaction(). In Fujii's patch (I am looking at synch_rep_0722.patch, is there a newer one?) the wait happens in RecordTransactionCommit() so other transactions still see the sync transaction and most importantly, the locks held by the sync transaction will make the async transactions waiting for the same lock wait too. > - proxy/cascading in core > - quorum setup for deciding any commit is safe > - any server can be excluded from the sync quorum > - timeout can still raises exception or ignore (commit)? > > This last point seems to need some more discussion, or I didn't > understand well the current positions and proposals. > > Regards, > Best regards, Zoltán Böszörményi
On Fri, Sep 3, 2010 at 6:43 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > In my patch, when the transactions were waiting for ack from > the standby, they have already released all their locks, the wait > happened at the latest possible point in CommitTransaction(). > > In Fujii's patch (I am looking at synch_rep_0722.patch, is there > a newer one?) No ;) We'll have to create the patch based on the result of the recent discussion held on other thread. > the wait happens in RecordTransactionCommit() > so other transactions still see the sync transaction and most > importantly, the locks held by the sync transaction will make > the async transactions waiting for the same lock wait too. The transaction should be invisible to other transactions until its replication has been completed. So I put the wait before CommitTransaction() calls ProcArrayEndTransaction(). Is this unsafe? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao írta: > On Fri, Sep 3, 2010 at 6:43 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > >> In my patch, when the transactions were waiting for ack from >> the standby, they have already released all their locks, the wait >> happened at the latest possible point in CommitTransaction(). >> >> In Fujii's patch (I am looking at synch_rep_0722.patch, is there >> a newer one?) >> > > No ;) > > We'll have to create the patch based on the result of the recent > discussion held on other thread. > > >> the wait happens in RecordTransactionCommit() >> so other transactions still see the sync transaction and most >> importantly, the locks held by the sync transaction will make >> the async transactions waiting for the same lock wait too. >> > > The transaction should be invisible to other transactions until > its replication has been completed. Invisible? How can it be invisible? You are in RecordTransactionCommit(), even before calling ProcArrayEndTransaction(MyProc, latestXid) and releasing the locks the transaction holds. > So I put the wait before > CommitTransaction() calls ProcArrayEndTransaction(). Is this unsafe? > I don't know whether it's unsafe. In my patch, I only registered the Xid at the point where you do WaitXLogSend(), this was the safe point to setup the waiting for sync ack. Otherwise, when the Xid registration for the sync ack was done in CommitTransaction() later than RecordTransactionCommit(), there was a race between the primary and the standby. The scenario was that the standby received and processed the COMMIT of certain Xids even before the backend on the primary properly registered its Xid, so the backend has set up the waiting for sync ack after this Xid was acked by the standby. The result was stuck backends. My idea to split up the registration for wait and the waiting itself would allow for transaction-level synchronous setting, i.e. in my patch the transaction released the locks and did all the post-commit cleanups *then* it waited for sync ack if needed. In the meantime, because locks were already released, other transactions could progress with their job, allowing that e.g. async transactions to progress and theoretically finish faster than the sync transaction that was waiting for the ack. The solution in my patch was not racy, registration of the Xid was done before XLogInsert() in RecordTransactionCommit(). If the standby acked the Xid to the primary before reaching the end of CommitTransaction() then this backend didn't even needed to wait because the Xid was found in its PGPROC structure and the waiting for sync ack was torn down. But with the LSNs, as you are waiting for XactLastRecEnd which is set by XLogInsert(). I don't know if it's safe to WaitXLogSend() after XLogFlush() in RecordTransactionCommit(). I remember that in previous instances of my patch even if I put the waiting for sync ack directly after latestXid = RecordTransactionCommit(); in CommitTransaction(), there were cases when I got stuck backends after a pgbench run. I had the primary and standbys on the same machine on different ports, so the ack was almost instant, which wouldn't be the case with a real network. But the race condition was still there it just doesn't show up with networks being slower than memory. In your patch, the waiting happens almost at the end of RecordTransactionCommit(), so theoretically it has the same race condition. Am I missing something? Best regards, Zoltán Böszörményi > Regards, > >
Boszormenyi Zoltan <zb@cybertec.at> writes: > Sorry for answering such an old mail, but what is the purpose of > a transaction level synchronous behaviour if async transactions > can be held back by a sync transaction? I don't understand why it would be the case (sync holding back async transactions) — it's been proposed that walsender could periodically feed back to the master the current WAL position received, synced and applied. So you can register your sync transaction to wait (and block) until walsender sees a synced WAL position after your own (including it) and another transaction can wait until walsender sees a received WAL position after its own, for example. Of course, meanwhile, any async transaction would just commit without caring about slaves. Not implementing it nor thinking about how to implement it, it seems simple enough :) Regards, -- dim
Dimitri Fontaine írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> Sorry for answering such an old mail, but what is the purpose of >> a transaction level synchronous behaviour if async transactions >> can be held back by a sync transaction? >> > > I don't understand why it would be the case (sync holding back async > transactions) — it's been proposed that walsender could periodically > feed back to the master the current WAL position received, synced and > applied. > > So you can register your sync transaction to wait (and block) until > walsender sees a synced WAL position after your own (including it) and > another transaction can wait until walsender sees a received WAL > position after its own, for example. Of course, meanwhile, any async > transaction would just commit without caring about slaves. > The locks held by a transaction are released after RecordTransactionCommit(), and waiting for the sync ack happens in this function. Now what happens when a sync transaction hold a lock that an async one is waiting for? > Not implementing it nor thinking about how to implement it, it seems > simple enough :) > > Regards, > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote: > Dimitri Fontaine írta: > > Boszormenyi Zoltan <zb@cybertec.at> writes: > > > >> Sorry for answering such an old mail, but what is the purpose of > >> a transaction level synchronous behaviour if async transactions > >> can be held back by a sync transaction? > >> > > > > I don't understand why it would be the case (sync holding back async > > transactions) — it's been proposed that walsender could periodically > > feed back to the master the current WAL position received, synced and > > applied. > > > > So you can register your sync transaction to wait (and block) until > > walsender sees a synced WAL position after your own (including it) and > > another transaction can wait until walsender sees a received WAL > > position after its own, for example. Of course, meanwhile, any async > > transaction would just commit without caring about slaves. > > > > The locks held by a transaction are released after > RecordTransactionCommit(), and waiting for the sync ack > happens in this function. Now what happens when a sync > transaction hold a lock that an async one is waiting for? It seems your glass in half-empty. Mine is half-full. My perspective would be that if there is contention between async and sync transactions then we will get better throughout than if all transactions were sync. Though perhaps the main issue in that case would be application lock contention, not the speed of synchronous replication. The highest level issue is that the system only has so much physical resources. If we are unable to focus our resources onto the things that matter most then we end up wasting resources. Mixing async and sync transactions on the same server allows a single application to carefully balance performance and durability. Exactly as we do with synchronous_commit. By now, people are beginning to see that synchronous replication is important but has poor performance. Fine grained control is essential to using it effectively in areas that matter most. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote: >> The locks held by a transaction are released after >> RecordTransactionCommit(), and waiting for the sync ack >> happens in this function. Now what happens when a sync >> transaction hold a lock that an async one is waiting for? > It seems your glass in half-empty. Mine is half-full. Simon, you really are failing to advance the conversation. You claim that we can have sync plus async transactions without a performance hit, but you have failed to explain how, at least in any way that anyone else understands. Pontificating about how that will be so much better than not having it doesn't address the problem that others are having with seeing how to implement it. regards, tom lane
Simon Riggs írta: > On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote: > >> Dimitri Fontaine írta: >> >>> Boszormenyi Zoltan <zb@cybertec.at> writes: >>> >>> >>>> Sorry for answering such an old mail, but what is the purpose of >>>> a transaction level synchronous behaviour if async transactions >>>> can be held back by a sync transaction? >>>> >>>> >>> I don't understand why it would be the case (sync holding back async >>> transactions) — it's been proposed that walsender could periodically >>> feed back to the master the current WAL position received, synced and >>> applied. >>> >>> So you can register your sync transaction to wait (and block) until >>> walsender sees a synced WAL position after your own (including it) and >>> another transaction can wait until walsender sees a received WAL >>> position after its own, for example. Of course, meanwhile, any async >>> transaction would just commit without caring about slaves. >>> >>> >> The locks held by a transaction are released after >> RecordTransactionCommit(), and waiting for the sync ack >> happens in this function. Now what happens when a sync >> transaction hold a lock that an async one is waiting for? >> > > It seems your glass in half-empty. Mine is half-full. This is good, we can meet halfway. :-) > My perspective > would be that if there is contention between async and sync transactions > then we will get better throughout than if all transactions were sync. > Though perhaps the main issue in that case would be application lock > contention, not the speed of synchronous replication. > The difference we are talking about is: xact1 xact2 begin begin lock something lock same (in commit) write wal record wait for sync ack release locks/etc <xact2 can proceed from here vs. xact1 xact2 begin begin lock something lock same (in commit) write wal record release locks/etc <xact2 can proceed from here wait for sync ack In the first case, the contention is obviously increased. With this, we are creating more idle time in the server instead of letting other transactions do their jobs as soon as possible. The second method was implemented in my patch. Are there any drawbacks with this? > The highest level issue is that the system only has so much physical > resources. If we are unable to focus our resources onto the things that > matter most then we end up wasting resources. Mixing async and sync > transactions on the same server allows a single application to carefully > balance performance and durability. Exactly as we do with > synchronous_commit. > I don't think this is the same situation. With synchronous_commit, you have an auxiliary process that's handed the job of doing the syncing. But there's nowhere to hand out the waiting for sync ack from the standby. > By now, people are beginning to see that synchronous replication is > important but has poor performance. Fine grained control is essential to > using it effectively in areas that matter most. > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote: > (in commit) > write wal record > release locks/etc <xact2 can proceed from here > wait for sync ack > > In the first case, the contention is obviously increased. > With this, we are creating more idle time in the server > instead of letting other transactions do their jobs as soon > as possible. The second method was implemented in my > patch. Are there any drawbacks with this? Then I respectfully suggest that you're releasing locks too early. Your proposal would allow a 2nd user to see the results of the 1st user's transaction before the 1st user knew about whether it had committed or not. I know why you want that, but I don't think its right. This has very little, if anything, to do with mixing async/sync connections. You make it sound like all transactions always wait for other transactions, which they definitely don't, especially in reasonably well designed applications. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Mon, Sep 6, 2010 at 10:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Then I respectfully suggest that you're releasing locks too early. > > Your proposal would allow a 2nd user to see the results of the 1st > user's transaction before the 1st user knew about whether it had > committed or not. > > I know why you want that, but I don't think its right. Well that's always possible. The 1st user might just not wake up before the 2nd user gets the response back. The question is what happens if the server crashes and is failed over to the slave. The 2nd user with the async transaction might have seen data commited by the 1st user with his sync transaction but was subsequently lost. Is the user expecting that making his transaction synchronously replicated guarantees that *nobody* can see this data unless the transaction is guaranteed to have been replicated or is he only expecting it to guarantee that *he* can't see the commit until it can be trusted to be replicated? For that matter I'm not entirely clear I understand how the timing here works at all. If transactions can't be considered to be committed before they're acknowledged by the replica what happens if the master crashes after the WAL is written and then comes back without a failover. Then the transaction would be immediately visible even if it still hasn't been replicated yet. I think there's no way with our current infrastructure to guarantee that other transactions can't see your data before it's been replicated, So making any promise otherwise for some cases is only going to be a lie. To guarantee synchronous replication doesn't show data until it's been replicated we would have to some kind of 2-phase commit where we send the commit record to the slave and wait until the slave has received it and confirmed it has written it (but it doesn't replay it unless there's a failover) then write the master's commit record and send the message to the slave that it's safe to replay those records. -- greg
On Mon, 2010-09-06 at 23:07 +0100, Greg Stark wrote: > On Mon, Sep 6, 2010 at 10:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Then I respectfully suggest that you're releasing locks too early. > > > > Your proposal would allow a 2nd user to see the results of the 1st > > user's transaction before the 1st user knew about whether it had > > committed or not. > > > > I know why you want that, but I don't think its right. > > Well that's always possible. The 1st user might just not wake up > before the 2nd user gets the response back. > > The question is what happens if the server crashes and is failed over > to the slave. The 2nd user with the async transaction might have seen > data commited by the 1st user with his sync transaction but was > subsequently lost. Is the user expecting that making his transaction > synchronously replicated guarantees that *nobody* can see this data > unless the transaction is guaranteed to have been replicated or is he > only expecting it to guarantee that *he* can't see the commit until it > can be trusted to be replicated? > > For that matter I'm not entirely clear I understand how the timing > here works at all. If transactions can't be considered to be committed > before they're acknowledged by the replica what happens if the master > crashes after the WAL is written and then comes back without a > failover. Then the transaction would be immediately visible even if it > still hasn't been replicated yet. > > I think there's no way with our current infrastructure to guarantee > that other transactions can't see your data before it's been > replicated, So making any promise otherwise for some cases is only > going to be a lie. > > To guarantee synchronous replication doesn't show data until it's been > replicated we would have to some kind of 2-phase commit where we send > the commit record to the slave and wait until the slave has received > it and confirmed it has written it (but it doesn't replay it unless > there's a failover) then write the master's commit record and send the > message to the slave that it's safe to replay those records. Just to add that this part of the discussion has nothing at all to do with my proposal for master controlled replication. Zoltan is simply discussing when the wait should occur with sync replication. I have no proposal to vary that myself, wherever we eventually decide the wait should occur. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Mon, 2010-09-06 at 16:11 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote: > >> The locks held by a transaction are released after > >> RecordTransactionCommit(), and waiting for the sync ack > >> happens in this function. Now what happens when a sync > >> transaction hold a lock that an async one is waiting for? > > > It seems your glass in half-empty. Mine is half-full. > > Simon, you really are failing to advance the conversation. You claim > that we can have sync plus async transactions without a performance hit, > but you have failed to explain how, at least in any way that anyone > else understands. Pontificating about how that will be so much better > than not having it doesn't address the problem that others are having > with seeing how to implement it. A performance hit from mixing sync and async is unlikely. The overhead of deciding whether to wait after commit is trivial. At worst, the async transactions would go at the same speed as the sync transactions, especially if the application contends with itself, which is by no means a certainty. If acting independently, the async transactions would clearly go much faster. So the right question for discussion is "how much will we gain by mixing async/sync?". Since we had exactly this situation for synchronous_commit and a similar discussion, I expect a similar eventual outcome. The discussion would go better if we had clear performance results published from existing work and we did not dissuade people from objective testing. Then you'd probably understand why I think this is so important to me. I've explained more than once how my proposal can work and Dimitri at least appears to have understood with zero off-list conversation. So far the discussion has been mostly negative and the reasons given haven't scored high on logic, I'm sorry to say. I will present a code-based proposal rather than just huge piles of words, to make this a more concrete discussion. -- Simon Riggs www.2ndQuadrant.com
On Mon, Sep 6, 2010 at 5:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Then I respectfully suggest that you're releasing locks too early. > > Your proposal would allow a 2nd user to see the results of the 1st > user's transaction before the 1st user knew about whether it had > committed or not. Marking the transaction committed in CLOG will have that effect anyway. We are not doing strict two-phase locking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 05/27/2010 01:28 PM, Robert Haas wrote: > How do you propose to guarantee that? ISTM that you have to either > commit locally first, or send the commit to the remote first. Either > way, the two events won't occur exactly simultaneously. I'm not getting the point of this discussion. As long as the database confirms the commit to the client only *after* having an ack from the standby and *after* committing locally, there's no problem. In any case, a server failure in between the commit request of the client and the commit confirmation for the client results in a client that can't tell if its transaction committed or not. So why do we care what to do first internally? Ideally, these two tasks should happen concurrently, IMO. Regards Markus Wanner
On Tue, Sep 7, 2010 at 4:01 AM, Markus Wanner <markus@bluegap.ch> wrote: > In any case, a server failure in between the commit request of the client > and the commit confirmation for the client results in a client that can't > tell if its transaction committed or not. > > So why do we care what to do first internally? Ideally, these two tasks > should happen concurrently, IMO. Right, definitely. The trouble is that if they happen concurrently, and there's a crash, you have to be prepared for the possibility that either one of the two has completed and the other is not. In practice, this means that the master and standby need to compare notes on the ending WAL location and whichever one is further advanced needs to stream the intervening records to the other. This would be an awesome feature, but it's hard, so for a first version, it makes sense to commit on the master first and then on the standby after the master is known done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/07/2010 02:16 PM, Robert Haas wrote: > Right, definitely. The trouble is that if they happen concurrently, > and there's a crash, you have to be prepared for the possibility that > either one of the two has completed and the other is not. Understood. > In > practice, this means that the master and standby need to compare notes > on the ending WAL location and whichever one is further advanced needs > to stream the intervening records to the other. Not necessarily, no. Remember that the client didn't get a commit confirmation. So reverting might also be a correct solution (i.e. not violating the durability constraint). > This would be an > awesome feature, but it's hard, so for a first version, it makes sense > to commit on the master first and then on the standby after the master > is known done. The obvious downside of that is that latency adds up, instead of just being the max of the two operations. And that for normal operation. While at best it saves an un-confirmed transaction in the failure case. It might be harder to implement, yes. Regards Markus Wanner
On Tue, Sep 7, 2010 at 9:45 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/07/2010 02:16 PM, Robert Haas wrote: >> >> Right, definitely. The trouble is that if they happen concurrently, >> and there's a crash, you have to be prepared for the possibility that >> either one of the two has completed and the other is not. > > Understood. > >> In >> practice, this means that the master and standby need to compare notes >> on the ending WAL location and whichever one is further advanced needs >> to stream the intervening records to the other. > > Not necessarily, no. Remember that the client didn't get a commit > confirmation. So reverting might also be a correct solution (i.e. not > violating the durability constraint). In theory, that's true, but if we do that, then there's an even bigger problem: the slave might have replayed WAL ahead of the master location; therefore the slave is now corrupt and a new base backup must be taken. >> This would be an >> awesome feature, but it's hard, so for a first version, it makes sense >> to commit on the master first and then on the standby after the master >> is known done. > > The obvious downside of that is that latency adds up, instead of just being > the max of the two operations. And that for normal operation. While at best > it saves an un-confirmed transaction in the failure case. > > It might be harder to implement, yes. Yeah, I hope we'll get there eventually. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/07/2010 04:15 PM, Robert Haas wrote: > In theory, that's true, but if we do that, then there's an even bigger > problem: the slave might have replayed WAL ahead of the master > location; therefore the slave is now corrupt and a new base backup > must be taken. The slave isn't corrupt. It would suffice to "late abort" committed transactions the master doesn't know about. However, I realize that undoing of WAL isn't something that's implemented (nor planned). So it's probably easier to forward the master in such a case. > Yeah, I hope we'll get there eventually. Understood. Thanks. Markus Wanner
Markus Wanner wrote: > On 09/07/2010 02:16 PM, Robert Haas wrote: >> practice, this means that the master and standby need to compare notes >> on the ending WAL location and whichever one is further advanced needs >> to stream the intervening records to the other. > > Not necessarily, no. Remember that the client didn't get a commit > confirmation. So reverting might also be a correct solution (i.e. not > violating the durability constraint). In that situation, wouldn't it be possible that a different client queried the slave and already saw the result of that transaction which would later be rolled back?
On Tue, 2010-09-07 at 16:31 +0200, Markus Wanner wrote: > On 09/07/2010 04:15 PM, Robert Haas wrote: > > In theory, that's true, but if we do that, then there's an even bigger > > problem: the slave might have replayed WAL ahead of the master > > location; therefore the slave is now corrupt and a new base backup > > must be taken. > > The slave isn't corrupt. It would suffice to "late abort" committed > transactions the master doesn't know about. The slave *might* be ahead of the master. And if it is, the case we're discussing is where the master just crashed and *might* not even be coming back at all, at least for a while. The standby does differ from master, but with the master down I don't regard that as a useful statement. If we wait for fsync on master and then transfer to standby the times are additive. If we do them concurrently the response times will be the maximum response time of fsync/transfer, as Markus observes. ISTM that most people would be more interested in reducing response times by ~50% rather than in being exactly correct in an edge case. So we should be planning that as a robustness option, not "it cannot be done", which seems to be echoing around to much for my liking. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Markus Wanner <markus@bluegap.ch> writes: > On 09/07/2010 04:15 PM, Robert Haas wrote: >> In theory, that's true, but if we do that, then there's an even bigger >> problem: the slave might have replayed WAL ahead of the master >> location; therefore the slave is now corrupt and a new base backup >> must be taken. > The slave isn't corrupt. It would suffice to "late abort" committed > transactions the master doesn't know about. Oh yes it is. If the slave replays WAL that didn't happen on the master, it might for instance have heap tuples in TID slots that are empty on the master, or index pages laid out differently from the master. Trying to apply additional WAL from the master will fail badly. We can *not* allow the slave to replay WAL ahead of what is known committed to disk on the master. The only way to make that safe is the compare-notes-and-ship-WAL-back approach that Robert mentioned. If you feel that decoupling WAL application is absolutely essential to have a credible feature, then you'd better bite the bullet and start working on the ship-WAL-back code. regards, tom lane
On Tue, 2010-09-07 at 11:17 -0400, Tom Lane wrote: > Markus Wanner <markus@bluegap.ch> writes: > > On 09/07/2010 04:15 PM, Robert Haas wrote: > >> In theory, that's true, but if we do that, then there's an even bigger > >> problem: the slave might have replayed WAL ahead of the master > >> location; therefore the slave is now corrupt and a new base backup > >> must be taken. > > > The slave isn't corrupt. It would suffice to "late abort" committed > > transactions the master doesn't know about. > > Oh yes it is. If the slave replays WAL that didn't happen on the > master, it might for instance have heap tuples in TID slots that are > empty on the master, or index pages laid out differently from the > master. Trying to apply additional WAL from the master will fail badly. > > We can *not* allow the slave to replay WAL ahead of what is known > committed to disk on the master. The only way to make that safe > is the compare-notes-and-ship-WAL-back approach that Robert mentioned. > > If you feel that decoupling WAL application is absolutely essential > to have a credible feature, then you'd better bite the bullet and > start working on the ship-WAL-back code. Why not just failover? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 09/07/2010 04:47 PM, Ron Mayer wrote: > In that situation, wouldn't it be possible that a different client > queried the slave and already saw the result of that transaction > which would later be rolled back? Good point, yes. Regards Markus Wanner
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-09-07 at 11:17 -0400, Tom Lane wrote: >> We can *not* allow the slave to replay WAL ahead of what is known >> committed to disk on the master. The only way to make that safe >> is the compare-notes-and-ship-WAL-back approach that Robert mentioned. >> >> If you feel that decoupling WAL application is absolutely essential >> to have a credible feature, then you'd better bite the bullet and >> start working on the ship-WAL-back code. > Why not just failover? Guaranteed failover is another large piece we don't have. regards, tom lane
Hi, On 09/07/2010 05:17 PM, Tom Lane wrote: > Oh yes it is. If the slave replays WAL that didn't happen on the > master, it might for instance have heap tuples in TID slots that are > empty on the master, or index pages laid out differently from the > master. Trying to apply additional WAL from the master will fail badly. Sure. Reverting to the master's state would be required to be able to safely proceed. Granted, that's far from simple. Robert's argument about read queries on the standby convinced me, that you always need to recover to the node with the newest transactions applied (i.e. better advance rather than revert). Making sure the standby can't ever be ahead of the master node certainly is the simplest way to guarantee that. At its cost for normal operation, though. How about a master failure which leads to a fail-over, immediately followed by a failure of that former standby (and now a master)? The old master might then be in the very same situation: having WAL applied that the new master doesn't. Do we require former masters to fetch a base backup? How does it know the difference, once it gets back up? > We can *not* allow the slave to replay WAL ahead of what is known > committed to disk on the master. The only way to make that safe > is the compare-notes-and-ship-WAL-back approach that Robert mentioned. Agreed. (And it's worth pointing out that this approach has a pretty nasty requirement for a full-cluster crash: all nodes that were synchronously replicated to need to come back up after such a crash, so as to be able to reliably determine which has the newest transaction). > If you feel that decoupling WAL application is absolutely essential > to have a credible feature, then you'd better bite the bullet and > start working on the ship-WAL-back code. My feeling is that WAL is the wrong format to do replication. But that's a another story. Regards Markus Wanner
On Tue, Sep 7, 2010 at 11:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-09-07 at 16:31 +0200, Markus Wanner wrote: >> On 09/07/2010 04:15 PM, Robert Haas wrote: >> > In theory, that's true, but if we do that, then there's an even bigger >> > problem: the slave might have replayed WAL ahead of the master >> > location; therefore the slave is now corrupt and a new base backup >> > must be taken. >> >> The slave isn't corrupt. It would suffice to "late abort" committed >> transactions the master doesn't know about. > > The slave *might* be ahead of the master. And if it is, the case we're > discussing is where the master just crashed and *might* not even be > coming back at all, at least for a while. The standby does differ from > master, but with the master down I don't regard that as a useful > statement. > > If we wait for fsync on master and then transfer to standby the times > are additive. If we do them concurrently the response times will be the > maximum response time of fsync/transfer, as Markus observes. > > ISTM that most people would be more interested in reducing response > times by ~50% rather than in being exactly correct in an edge case. So > we should be planning that as a robustness option, not "it cannot be > done", which seems to be echoing around to much for my liking. People who are more concerned about performance than robustness aren't going to use sync rep in the first place. They're going to run it in async, which will improve performance by FAR more than you'll ever be able to manage by deciding that you don't care about handling some of the failure cases correctly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 09/07/2010 06:00 PM, Robert Haas wrote: > People who are more concerned about performance than robustness aren't > going to use sync rep in the first place. I'm advocating sync (or eager, FWIW) replication for years, now. One of the hardest preconception I'm always confronted with is: this must perform poorly! Whether or not that's true depends, but my point is: people who need that level of robustness certainly care about performance as well. Telling them to use async replication instead is not an option. (The ability to mix sync and async replication per transaction is one, BTW). > They're going to run it in > async, which will improve performance by FAR more than you'll ever be > able to manage by deciding that you don't care about handling some of > the failure cases correctly. Running in async and then trying to achieve the required level of robustness in the application layer pretty certainly performs worse than a good sync replication implementation. Async only wins if you really don't care about the loss of transactions in the case of a failure. In every other case, robustness is better taken care of by the database system itself, IMO. That being said, I certainly agree to do things step by step. And the ability to write to WAL and wait for ack from a standby concurrently can (and probably should) be considered an optimization, yes. Regards Markus Wanner
On 09/07/2010 05:55 PM, Markus Wanner wrote: > Robert's argument Sorry, I meant Ron. Regards Markus Wanner
On Tue, Sep 7, 2010 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We can *not* allow the slave to replay WAL ahead of what is known > committed to disk on the master. The only way to make that safe > is the compare-notes-and-ship-WAL-back approach that Robert mentioned. > > If you feel that decoupling WAL application is absolutely essential > to have a credible feature, then you'd better bite the bullet and > start working on the ship-WAL-back code. > In the mode where it is not required that the WAL is applied (only sent to the slave / synced to slave disk) one alternative is to have a separate pointer to the last WAL record that can be safely applied on the slave. Then You can send the un-synced WAL to the slave (while concurrently syncing it on the master). When both the slave an the master sync complete, one can give the client a commit notification, increase the pointer, and send it to the slave (it would be a separate WAL record type I guess). In case of master failure, the slave can discard the un-applied WAL after the pointer. Greetings marcin
On Tue, Sep 7, 2010 at 4:06 PM, marcin mank <marcin.mank@gmail.com> wrote: > On Tue, Sep 7, 2010 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We can *not* allow the slave to replay WAL ahead of what is known >> committed to disk on the master. The only way to make that safe >> is the compare-notes-and-ship-WAL-back approach that Robert mentioned. >> >> If you feel that decoupling WAL application is absolutely essential >> to have a credible feature, then you'd better bite the bullet and >> start working on the ship-WAL-back code. >> > > In the mode where it is not required that the WAL is applied (only > sent to the slave / synced to slave disk) one alternative is to have a > separate pointer to the last WAL record that can be safely applied on > the slave. Then You can send the un-synced WAL to the slave (while > concurrently syncing it on the master). When both the slave an the > master sync complete, one can give the client a commit notification, > increase the pointer, and send it to the slave (it would be a separate > WAL record type I guess). > > In case of master failure, the slave can discard the un-applied WAL > after the pointer. But the pointer on the slave has to be fsync'd to make it persistent, which likely takes roughly the same amount of time as fsync-ing the WAL itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, Sep 7, 2010 at 6:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote: >> (in commit) >> write wal record >> release locks/etc <xact2 can proceed from here >> wait for sync ack >> >> In the first case, the contention is obviously increased. >> With this, we are creating more idle time in the server >> instead of letting other transactions do their jobs as soon >> as possible. The second method was implemented in my >> patch. Are there any drawbacks with this? > > Then I respectfully suggest that you're releasing locks too early. > > Your proposal would allow a 2nd user to see the results of the 1st > user's transaction before the 1st user knew about whether it had > committed or not. > > I know why you want that, but I don't think its right. Agreed. That's why I put the wait before ProcArrayEndTransaction() is called. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao írta: > On Tue, Sep 7, 2010 at 6:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote: >> >>> (in commit) >>> write wal record >>> release locks/etc <xact2 can proceed from here >>> wait for sync ack >>> >>> In the first case, the contention is obviously increased. >>> With this, we are creating more idle time in the server >>> instead of letting other transactions do their jobs as soon >>> as possible. The second method was implemented in my >>> patch. Are there any drawbacks with this? >>> >> Then I respectfully suggest that you're releasing locks too early. >> >> Your proposal would allow a 2nd user to see the results of the 1st >> user's transaction before the 1st user knew about whether it had >> committed or not. >> >> I know why you want that, but I don't think its right. >> > > Agreed. That's why I put the wait before ProcArrayEndTransaction() > is called. > Then there is no use to implement individual sync/async replicated transactions, period. An async replicated transaction that waits for a sync replicated transaction because of locks will become implicitely sync. It just waits for another transactions' sync ack. Best regards, Zoltán Böszörményi > Regards, > > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > Then there is no use to implement individual sync/async > replicated transactions, period. An async replicated transaction > that waits for a sync replicated transaction because of locks > will become implicitely sync. It just waits for another transactions' > sync ack. Hmm.. it's the same with async transaction (i.e., synchronous_commit = false) and sync one (synchronous_commit = true). Async transaction cannot take the lock held by sync one until the sync has flushed the WAL. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao írta: > On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > >> Then there is no use to implement individual sync/async >> replicated transactions, period. An async replicated transaction >> that waits for a sync replicated transaction because of locks >> will become implicitely sync. It just waits for another transactions' >> sync ack. >> > > Hmm.. it's the same with async transaction (i.e., synchronous_commit = false) > and sync one (synchronous_commit = true). Async transaction cannot take the > lock held by sync one until the sync has flushed the WAL. > You are right. -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On Wed, Sep 8, 2010 at 6:52 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > Fujii Masao írta: >> On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: >> >>> Then there is no use to implement individual sync/async >>> replicated transactions, period. An async replicated transaction >>> that waits for a sync replicated transaction because of locks >>> will become implicitely sync. It just waits for another transactions' >>> sync ack. >>> >> >> Hmm.. it's the same with async transaction (i.e., synchronous_commit = false) >> and sync one (synchronous_commit = true). Async transaction cannot take the >> lock held by sync one until the sync has flushed the WAL. >> > > You are right. I still don't see why it matters whether you wait before or after releasing locks. As soon as the transaction is marked committed in CLOG, other transactions can potentially see its effects. Holding on to all the locks might mitigate that somewhat, but it's not going to eliminate the problem. And in any event, there is ALWAYS a window of time during which the client doesn't know the transaction has committed but other transactions can potentially see its effects. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Sep 8, 2010 at 8:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I still don't see why it matters whether you wait before or after > releasing locks. As soon as the transaction is marked committed in > CLOG, other transactions can potentially see its effects. AFAIR, even if CLOG has been updated, until the transaction is marked as no longer running in PGPROC, probably other transactions cannot see its effects. But, if it's not true, I'd make the transaction wait for replication before CLOG update. > And in any event, there is ALWAYS a window of > time during which the client doesn't know the transaction has > committed but other transactions can potentially see its effects. Yep. The problem here is that synchronous replication is likely to make the window very big. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 09/08/2010 12:04 PM, Boszormenyi Zoltan wrote: > Then there is no use to implement individual sync/async > replicated transactions, period. I disagree. Different transactions have different priorities for latency vs. failure-resistance. > An async replicated transaction > that waits for a sync replicated transaction because of locks > will become implicitely sync. Sure. But how often do your transactions wait for another one because of locks? What do we have MVCC for? Regards Markus Wanner
On Wed, 2010-09-08 at 12:04 +0200, Boszormenyi Zoltan wrote: > >> > >> I know why you want that, but I don't think its right. > >> > > > > Agreed. That's why I put the wait before ProcArrayEndTransaction() > > is called. > > > > Then there is no use to implement individual sync/async > replicated transactions, period. An async replicated transaction > that waits for a sync replicated transaction because of locks > will become implicitely sync. It just waits for another transactions' > sync ack. You aren't making any sense. You have made a general observation and deduced something specific about replication from it. Most transactions are not blocked by locks, especially in well designed applications, so the argument is not relevant to replication. If *any* two transactions wait upon each other then t2 will always wait until t1 has completed. If t1 is slow then any tuning you do on t2 will likely be wasted. If you are concerned about performance you should first remove the dependency between t1 and t2. The above observation isn't sufficient to conclude that "tuning of t2 should not happen via the tuning feature Simon has suggested". It's not sufficient to conclude much, if anything. As it turns out, in the scenario you outline t2 *would* return faster because you had marked it as "async". But it would wait behind t1, as you say. So the performance gain will be clear and measurable. Even so, it would be best to tune the problem (lock contention) not moan that the tool you're choosing to use using (tuning replication) is at fault for being inappropriate to the problem. Mixing sync and async transactions is useful and it's a simple matter to come up with real examples where it would benefit, as well as easily testable workloads using pgbench. For example, customer table updates (sync) alongside chat messages (async). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> And in any event, there is ALWAYS a window of >> time during which the client doesn't know the transaction has >> committed but other transactions can potentially see its effects. > > Yep. The problem here is that synchronous replication is likely to > make the window very big. So what? If the correctness of your application depends on the *amount of time* this window lasts, it's already broken. It seems like you're arguing that we should artificially increase lock contention to guard against possible race conditions in user applications. That doesn't make any sense to me, so one of us is confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Sep 8, 2010 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> And in any event, there is ALWAYS a window of >>> time during which the client doesn't know the transaction has >>> committed but other transactions can potentially see its effects. >> >> Yep. The problem here is that synchronous replication is likely to >> make the window very big. > > So what? If the correctness of your application depends on the > *amount of time* this window lasts, it's already broken. It seems > like you're arguing that we should artificially increase lock > contention to guard against possible race conditions in user > applications. That doesn't make any sense to me, so one of us is > confused. Yep ;) On second thought, the problem here is that the effects of the transaction marked as committed but still waiting for replication can disappear after failover. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Sep 8, 2010 at 9:32 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Sep 8, 2010 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> And in any event, there is ALWAYS a window of >>>> time during which the client doesn't know the transaction has >>>> committed but other transactions can potentially see its effects. >>> >>> Yep. The problem here is that synchronous replication is likely to >>> make the window very big. >> >> So what? If the correctness of your application depends on the >> *amount of time* this window lasts, it's already broken. It seems >> like you're arguing that we should artificially increase lock >> contention to guard against possible race conditions in user >> applications. That doesn't make any sense to me, so one of us is >> confused. > > Yep ;) On second thought, the problem here is that the effects of > the transaction marked as committed but still waiting for replication > can disappear after failover. Ah! I think that's right. So the scenario we're trying to guard against something like this. A customer makes a withdrawal of money from an ATM; their bank balance is debited. The transaction tries to commit. After the transaction becomes visible to other backends but before WAL is reaches the standby, another transaction begins and reads the customer's balance. Naturally, they get the new, lower balance. Crash, master dead. Failover. If another transcation begins and reads the customer's balance again, it's back to the old value. So we have a phantom transaction: it appeared as committed and then vanished again. So that means we have to make sure that none of the effects of a transaction can be seen until WAL is fsync'd on the master AND the slave has acked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote: > So that means we have to make sure that none of the effects of a > transaction can be seen until WAL is fsync'd on the master AND the > slave has acked. Yes, that's right. And I like your example; one for the docs. There is a slight complexity there: An application might connect to the standby and see the changes made by the transaction, even though the master has not yet been notified, but will be in a moment. I don't see that as an issue though, but worth mentioning cos its just the "Byzantine Generals" problem. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, Sep 08, 2010 at 03:22:46PM +0100, Simon Riggs wrote: > On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote: > > > So that means we have to make sure that none of the effects of a > > transaction can be seen until WAL is fsync'd on the master AND the > > slave has acked. > > Yes, that's right. And I like your example; one for the docs. > > There is a slight complexity there: An application might connect to > the standby and see the changes made by the transaction, even though > the master has not yet been notified, but will be in a moment. I > don't see that as an issue though, but worth mentioning cos its just > the "Byzantine Generals" problem. For completeness, a reference to the aforementioned Byzantine Generals: http://en.wikipedia.org/wiki/Byzantine_fault_tolerance Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Sep 8, 2010 at 10:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote: > >> So that means we have to make sure that none of the effects of a >> transaction can be seen until WAL is fsync'd on the master AND the >> slave has acked. > > Yes, that's right. And I like your example; one for the docs. > > There is a slight complexity there: An application might connect to the > standby and see the changes made by the transaction, even though the > master has not yet been notified, but will be in a moment. I don't see > that as an issue though, but worth mentioning cos its just the > "Byzantine Generals" problem. I think that's OK too, because there's no way we can guarantee that the transaction becomes visible exactly simultaneously on both nodes. What we do need to guarantee is that it is known committed on both nodes before it becomes visible on either, so that even if there is a crash or failover it can't uncommit itself. So the order of events must be: - fsync WAL on master - send WAL to slave - wait for ack from slave - allow transaction's effects to become visible on master If the slave is only guaranteeing *receipt* of the WAL rather than fsync or replay of the WAL, then there is still a possibility of a disappearing transaction if the master and standby fail simultaneously AND a failover then occurs. So don't pick that mode if a disappearing transaction will result in someone dying or your $20B company going bankrupt or ... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company