Thread: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote: > On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas > <heikki@postgresql.org> wrote: > > Log Message: > > ----------- > > Use a latch to make startup process wake up and replay immediately when > > new WAL arrives via streaming replication. This reduces the latency, and > > also allows us to use a longer polling interval, which is good for energy > > efficiency. > > > > We still need to poll to check for the appearance of a trigger file, but > > the interval is now 5 seconds (instead of 100ms), like when waiting for > > a new WAL segment to appear in WAL archive. > > Good work! No, not good work. You both know very well that I'm working on this area also and these commits are not agreed... yet. They might not be contended but they are very likely to break my patch, again. Please desist while we resolve which are the good ideas and which are not. We won't know that if you keep breaking other people's patches in a stream of commits that prevent anybody completing other options. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
David Fetter
Date:
On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote: > On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote: > > On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas > > <heikki@postgresql.org> wrote: > > > Log Message: > > > ----------- > > > Use a latch to make startup process wake up and replay immediately when > > > new WAL arrives via streaming replication. This reduces the latency, and > > > also allows us to use a longer polling interval, which is good for energy > > > efficiency. > > > > > > We still need to poll to check for the appearance of a trigger file, but > > > the interval is now 5 seconds (instead of 100ms), like when waiting for > > > a new WAL segment to appear in WAL archive. > > > > Good work! > > No, not good work. > > You both know very well that I'm working on this area also and these > commits are not agreed... yet. They might not be contended but they are > very likely to break my patch, again. > > Please desist while we resolve which are the good ideas and which are > not. We won't know that if you keep breaking other people's patches in a > stream of commits that prevent anybody completing other options. Simon, No matter how many times you try, you are not going to get a license to stop all work on anything you might chance to think about. It is quite simply never going to happen, so you need to back off. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Wed, 2010-09-15 at 07:59 -0700, David Fetter wrote: > On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote: > > Please desist while we resolve which are the good ideas and which are > > not. We won't know that if you keep breaking other people's patches in a > > stream of commits that prevent anybody completing other options. > No matter how many times you try, you are not going to get a license > to stop all work on anything you might chance to think about. It is > quite simply never going to happen, so you need to back off. I agree that asking people to stop work is not OK. However, I haven't asked for development work to stop, only that commits into that area stop until proper debate has taken place. Those might be minor commits, but they might not. Had I made those commits, they would have been called premature by others also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I agree that asking people to stop work is not OK. However, I haven't > asked for development work to stop, only that commits into that area > stop until proper debate has taken place. Those might be minor commits, > but they might not. Had I made those commits, they would have been > called premature by others also. I do not believe that Heikki has done anything inappropriate. We've spent weeks discussing the latch facility and its various applications. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Wed, 2010-09-15 at 11:25 -0400, Tom Lane wrote: > ... an unspecified patch with no firm delivery date. I'm happy to post my current work, if it's considered helpful. The sole intent of that work is to help the community understand the benefits of the proposals I have made, so perhaps this patch does serve that purpose. The attached patch compiles, but I wouldn't bother trying to run it yet. I'm still wading through the latch rewrite. It probably doesn't apply cleanly to head anymore either, hence discussion. I wouldn't normally waste people's time by posting a non-working patch, the majority of the code is in about the right place of execution. There aren't any unclear aspects in the design, so its worth looking at. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Attachment
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote: > On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > I agree that asking people to stop work is not OK. However, I haven't > > asked for development work to stop, only that commits into that area > > stop until proper debate has taken place. Those might be minor commits, > > but they might not. Had I made those commits, they would have been > > called premature by others also. > > I do not believe that Heikki has done anything inappropriate. We've > spent weeks discussing the latch facility and its various > applications. Sounds reasonable, but my comments were about this commit, not the one that happened on Saturday. This patch was posted about 32 hours ago, and the commit need not have taken place yet. If I had posted such a patch and committed it knowing other work is happening in that area we both know that you would have objected. It's not actually a major issue, but at some point I have to ask for no more commits, so Fujii and I can finish our patches, compare and contrast, so the best ideas can get into Postgres. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote: >> On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > I agree that asking people to stop work is not OK. However, I haven't >> > asked for development work to stop, only that commits into that area >> > stop until proper debate has taken place. Those might be minor commits, >> > but they might not. Had I made those commits, they would have been >> > called premature by others also. >> >> I do not believe that Heikki has done anything inappropriate. We've >> spent weeks discussing the latch facility and its various >> applications. > > Sounds reasonable, but my comments were about this commit, not the one > that happened on Saturday. This patch was posted about 32 hours ago, and > the commit need not have taken place yet. If I had posted such a patch > and committed it knowing other work is happening in that area we both > know that you would have objected. I've often felt that we ought to have a bit more delay between when committers post patches and when they commit them. I was told 24 hours and I've seen cases where people haven't even waited that long. On the other hand, if we get to strict about it, it can easily get to the point where it just gets in the way of progress, and certainly some patches are far more controversial than others. So I don't know what the best thing to do is. Still, I have to admit that I feel fairly positive about the direction we're going with this particular patch. Clearing away these peripheral issues should make it easier for us to have a rational discussion about the core issues around how this is going to be configured and actually work at the protocol level. > It's not actually a major issue, but at some point I have to ask for no > more commits, so Fujii and I can finish our patches, compare and > contrast, so the best ideas can get into Postgres. I don't think anyone is prepared to agree to that. I think that everyone is prepared to accept a limited amount of further delay in pressing forward with the main part of sync rep, but I expect that no one will be willing to freeze out incremental improvements in the meantime, even if it does induce a certain amount of rebasing. It's also worth noting that Fujii Masao's patch has been around for months, and yours isn't finished yet. That's not to say that we don't want to consider your ideas, because we do: and you've had more than your share of good ones. At the same time, it would be unfair and unreasonable to expect work on a patch that is done, and has been done for some time, to wait on one that isn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Heikki Linnakangas
Date:
On 15/09/10 20:58, Robert Haas wrote: > On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote: >>> On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >>>> I agree that asking people to stop work is not OK. However, I haven't >>>> asked for development work to stop, only that commits into that area >>>> stop until proper debate has taken place. Those might be minor commits, >>>> but they might not. Had I made those commits, they would have been >>>> called premature by others also. >>> >>> I do not believe that Heikki has done anything inappropriate. We've >>> spent weeks discussing the latch facility and its various >>> applications. >> >> Sounds reasonable, but my comments were about this commit, not the one >> that happened on Saturday. This patch was posted about 32 hours ago, and >> the commit need not have taken place yet. If I had posted such a patch >> and committed it knowing other work is happening in that area we both >> know that you would have objected. > > I've often felt that we ought to have a bit more delay between when > committers post patches and when they commit them. I was told 24 > hours and I've seen cases where people haven't even waited that long. > On the other hand, if we get to strict about it, it can easily get to > the point where it just gets in the way of progress, and certainly > some patches are far more controversial than others. So I don't know > what the best thing to do is. With anything non-trivial, I try to "sleep on it" before committing. More with complicated patches, but it's really up to your own comfort level with the patch, and whether you think anyone might have different opinions on it. I don't mind quick commits if it's something that has been discussed in the past and the committer thinks it's non-controversial. There's always the option of complaining afterwards. If it comes to that, though, it wasn't really ripe for committing yet. (That doesn't apply to gripes about typos or something like that, because that happens to me way too often ;-) ) > Still, I have to admit that I feel > fairly positive about the direction we're going with this particular > patch. Clearing away these peripheral issues should make it easier > for us to have a rational discussion about the core issues around how > this is going to be configured and actually work at the protocol > level. Yeah, I don't think anyone has any qualms about the substance of these patches. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Wed, 2010-09-15 at 13:58 -0400, Robert Haas wrote: > > It's not actually a major issue, but at some point I have to ask for > no > > more commits, so Fujii and I can finish our patches, compare and > > contrast, so the best ideas can get into Postgres. > > I don't think anyone is prepared to agree to that. I think that > everyone is prepared to accept a limited amount of further delay in > pressing forward with the main part of sync rep, but I expect that no > one will be willing to freeze out incremental improvements in the > meantime, even if it does induce a certain amount of rebasing. > It's > also worth noting that Fujii Masao's patch has been around for months, > and yours isn't finished yet. That's not to say that we don't want to > consider your ideas, because we do: and you've had more than your > share of good ones. At the same time, it would be unfair and > unreasonable to expect work on a patch that is done, and has been done > for some time, to wait on one that isn't. I understand your viewpoint there. I'm sure we all agree sync rep is a very important feature that must get into the next release. The only reason my patch exists is because debate around my ideas was ruled out on various grounds. One of those was it would take so long to develop we shouldn't risk not getting sync rep in this release. I am amenable to such arguments (and I make the same one on MERGE, btw, where I am getting seriously worried) but the reality is that there is actually very little code here and we can definitely do this, whatever ideas we pick. I've shown this by providing an almost working version in about 4 days work. Will finishing it help? We definitely have the time, so the question is, what are the best ideas? We must discuss the ideas properly, not just plough forwards claiming time pressure when it isn't actually an issue at all. We *need* to put the tools down and talk in detail about the best way forwards. Before, I had no patch. Now mine "isn't finished". At what point will my ideas be reviewed without instant dismissal? If we accept your seniority argument, then "never" because even if I finish it you'll say "Fujii was there first". If who mentioned it first was important, then I'd say I've been discussing this for literally years (late 2006) and have regularly explained the benefits of the master-side approach I've outlined on list every time this has come up (every few months). I have also explained the implementation details many times as well an I'm happy to say that latches are pretty much exactly what I described earlier. (I called them LSN queues, similar to lwlocks, IIRC). But thats not the whole deal. If we simply wanted a patch that was "done" we would have gone with Zoltan's wouldn't we, based on the seniority argument you use above? Zoltan's patch didn't perform well at all. Fujii's performs much better. However, my proposed approach offers even better performance, so whatever argument you use to include Fujii's also applies to mine doesn't it? But that's silly and divisive, its not about who's patch "wins" is it? Do we have to benchmark multiple patches to prove which is best? If that's the criteria I'll finish my patch and demonstrate that. But it doesn't make sense to start committing pieces of Fujii's patch, so that I can't ever keep up and as a result "Simon never finished his patch, but it sounded good". Next steps should be: tools down, discuss what to do. Then go forwards. We have time, so lets discuss all of the ideas on the table not just some of them. For me this is not about the number or names of parameters, its about master-side control of sync rep and having very good performance. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 3:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Will finishing it help? Yes, I expect that to help a lot. > Before, I had no patch. Now mine "isn't finished". At what point will my > ideas be reviewed without instant dismissal? If we accept your seniority > argument, then "never" because even if I finish it you'll say "Fujii was > there first". I said very clearly in my previous email that "I think that everyone is prepared to accept a limited amount of further delay in pressing forward with the main part of sync rep". In other words, I think everyone is willing to consider your ideas provided that they are submitted in a form which everyone can understand and think through sometime soon. I am not, nor do I think anyone is, saying that we don't wish to consider your ideas. I'm actually really pleased that you are only a day or two from having a working patch. It can be much easier to conceptualize a patch than to find the time to finish it (unfortunately, this problem has overtaken me rather badly in the last few weeks, which is why I have no new patches in this CommitFest) and if you can finish it up and get it out in front of everyone I expect that to be a good thing for this feature and our community. > Do we have to benchmark multiple patches to prove which is best? If > that's the criteria I'll finish my patch and demonstrate that. I was thinking about that earlier today. I think it's definitely possible that we'll need to do some benchmarking, although I expect that people will want to read the code first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Fujii Masao
Date:
On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > We definitely have the time, so the question is, what are the best > ideas? Before advancing the review of each patch, we must determine what should be committed in 9.1, and what's in this CF. "Synchronization level on per-transaction" feature is included in Simon's patch, but not in mine. This is most important difference, which would have wide-reaching impact on the implementation, e.g., protocol between walsender and walreceiver. So, at first we should determine whether we'll commit the feature in 9.1. Then we need to determine how far we should implement in this CF. Thought? Each patch provides "synchronization level on per-standby" feature. In Simon's patch, that level is specified in the standbys's recovery.conf. In mine, it's in the master's standbys.conf. I think that the former is simpler. But if we support the capability to register the standbys, the latter would be required. Which is the best? Simon's patch seems to include simple quorum commit feature (correct me if I'm wrong). That is, when there are multiple synchronous standbys, the master waits until ACK has arrived from at least one standby. OTOH, in my patch, the master waits until ACK has arrived from all the synchronous standbys. Which should we choose? I think that we should commit my straightforward approach first, and enable the quorum commit on that. Thought? Simon proposes to invoke walwriter in the standby. This is not included in my patch, but looks good idea. ISTM that this is not essential feature for synchronous replication, so how about detachmenting of the walwriter part from the patch and reviewing it independently? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay
From
Simon Riggs
Date:
On Fri, 2010-09-17 at 14:33 +0900, Fujii Masao wrote: > On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > We definitely have the time, so the question is, what are the best > > ideas? > > Before advancing the review of each patch, we must determine what > should be committed in 9.1, and what's in this CF. Thank you for starting the discussion. > "Synchronization level on per-transaction" feature is included in Simon's > patch, but not in mine. This is most important difference Agreed. It's also a very important option for users. > which would > have wide-reaching impact on the implementation, e.g., protocol between > walsender and walreceiver. So, at first we should determine whether we'll > commit the feature in 9.1. Then we need to determine how far we should > implement in this CF. Thought? Yes, sync rep specified per-transaction changes many things at a low level. Basically, we have a choice of two mostly incompatible implementations, plus some other options common to both. There is no danger that we won't commit in 9.1. We have time for discussion and thought. We also have time for performance testing and since many of my design proposals are performance related that seems essential to properly reviewing the patches. I don't think we can determine how far to implement without considering both approaches in detail. With regard to your points below, I don't think any of those points could be committed first. > Each patch provides "synchronization level on per-standby" feature. In > Simon's patch, that level is specified in the standbys's recovery.conf. > In mine, it's in the master's standbys.conf. I think that the former is simpler. > But if we support the capability to register the standbys, the latter would > be required. Which is the best? Either approach is OK for me. Providing both options is also possible. My approach was just less code and less change to existing mechanisms, so I did it that way. There are some small optimisations possible on standby if the standby knows what role it's being asked to play. It doesn't matter to me whether we let standby tell master or master tell standby and the code is about the same either way. > Simon's patch seems to include simple quorum commit feature (correct > me if I'm wrong). That is, when there are multiple synchronous standbys, > the master waits until ACK has arrived from at least one standby. OTOH, > in my patch, the master waits until ACK has arrived from all the synchronous > standbys. Which should we choose? I think that we should commit my > straightforward approach first, and enable the quorum commit on that. > Thought? Yes, my approach is simple. For those with Oracle knowledge, my approach (first-reply-releases-waiter) is equivalent to Oracle's Maximum Protection mode (= 'fsync' in my design). Providing even higher levels of protection would not be the most common case. Your approach of waiting for all replies is much slower and requires more complex code, since we need to track intermediate states. It also has additional complexities of behaviour, such as how long do we wait for second acknowledgement when we already have one, and what happens when a second ack is not received? More failure modes == less stable. ISTM that it would require more effort to do this also, since every ack needs to check all WAL sender data to see if it is the last ack. None of that seems straightforward. I don't agree we should commit your approach to that aspect. In my proposal, such additional features would be possible as a plugin. The majority of users would not this facility and the plugin leaves the way open for high-end users that need this. > Simon proposes to invoke walwriter in the standby. This is not included > in my patch, but looks good idea. ISTM that this is not essential feature > for synchronous replication, so how about detachmenting of the walwriter > part from the patch and reviewing it independently? I regard it as an essential feature for implementing 'recv' mode of sync rep, which is the fastest mode. At present WALreceiver does all of these: receive, write and fsync. Of those the fsync is the slowest and increases response time significantly. Of course 'recv' option doesn't need to be part of first commit, but splitting commits doesn't seem likely to make this go quicker or easier in the early stages. In particular, splitting some features out could make it much harder to put back in again later. That point is why my patch even exists. I would like to express my regret that the main feature proposal from me necessitates low level changes that cause our two patches to be in conflict. Nobody should take this as a sign that there is a personal or professional problem between Fujii-san and myself. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
(changed subject again.) On 17/09/10 10:06, Simon Riggs wrote: > I don't think we can determine how far to implement without considering > both approaches in detail. With regard to your points below, I don't > think any of those points could be committed first. Yeah, I think we need to decide on the desired feature set first, before we dig deeper into the the patches. The design and implementation will fall out of that. That said, there's a few small things that can be progressed regardless of the details of synchronous replication. There's the changes to trigger failover with a signal, and it seems that we'll need some libpq changes to allow acknowledgments to be sent back to the master regardless of the rest of the design. We can discuss those in separate threads in parallel. So the big question is what the user interface looks like. How does one configure synchronous replication, and what options are available. Here's a list of features that have been discussed. We don't necessarily need all of them in the first phase, but let's avoid painting ourselves in the corner. * Support multiple standbys with various synchronization levels. * What happens if a synchronous standby isn't connected at the moment? Return immediately vs. wait forever. * Per-transaction control. Some transactions are important, others are not. * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers can be seen as important special cases of this. * async, recv, fsync and replay levels of synchronization. So what should the user interface be like? Given the 1st and 2nd requirement, we need standby registration. If some standbys are important and others are not, the master needs to distinguish between them to be able to determine that a transaction is safely delivered to the important standbys. For per-transaction control, ISTM it would be enough to have a simple user-settable GUC like synchronous_commit. Let's call it "synchronous_replication_commit" for now. For non-critical transactions, you can turn it off. That's very simple for developers to understand and use. I don't think we need more fine-grained control than that at transaction level, in all the use cases I can think of you have a stream of important transactions, mixed with non-important ones like log messages that you want to finish fast in a best-effort fashion. I'm actually tempted to tie that to the existing synchronous_commit GUC, the use case seems exactly the same. OTOH, if we do want fine-grained per-transaction control, a simple boolean or even an enum GUC doesn't really cut it. For truly fine-grained control you want to be able to specify exceptions like "wait until this is replayed in slave named 'reporting'" or 'don't wait for acknowledgment from slave named 'uk-server'". With standby registration, we can invent a syntax for specifying overriding rules in the transaction. Something like SET replication_exceptions = 'reporting=replay, uk-server=async'. For the control between async/recv/fsync/replay, I like to think in terms of a) asynchronous vs synchronous b) if it's synchronous, how synchronous is it? recv, fsync or replay? I think it makes most sense to set sync vs. async in the master, and the level of synchronicity in the slave. Although I have sympathy for the argument that it's simpler if you configure it all from the master side as well. Putting all of that together. I think Fujii-san's standby.conf is pretty close. What it needs is the additional GUC for transaction-level control. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote: > That said, there's a few small things that can be progressed > regardless of the details of synchronous replication. There's the > changes to trigger failover with a signal, and it seems that we'll > need some libpq changes to allow acknowledgments to be sent back to > the master regardless of the rest of the design. We can discuss those > in separate threads in parallel. Agree to both of those points. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > * Support multiple standbys with various synchronization levels. > > * What happens if a synchronous standby isn't connected at the moment? > Return immediately vs. wait forever. > > * Per-transaction control. Some transactions are important, others are not. > > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers > can be seen as important special cases of this. > > * async, recv, fsync and replay levels of synchronization. > > So what should the user interface be like? Given the 1st and 2nd > requirement, we need standby registration. If some standbys are important > and others are not, the master needs to distinguish between them to be able > to determine that a transaction is safely delivered to the important > standbys. Well the 1st point can be handled in a distributed fashion, where the sync level is setup at the slave. Ditto for second point, you can get the exact same behavior control attached to the quorum facility. What I think you're description is missing is the implicit feature that you want to be able to setup the "ignore-or-wait" failure behavior per standby. I'm not sure we need that, or more precisely that we need to have that level of detail in the master's setup. Maybe what we need instead is a more detailed quorum facility, but as you're talking about something similar later in the mail, let's follow you. > For per-transaction control, ISTM it would be enough to have a simple > user-settable GUC like synchronous_commit. Let's call it > "synchronous_replication_commit" for now. For non-critical transactions, you > can turn it off. That's very simple for developers to understand and use. I > don't think we need more fine-grained control than that at transaction > level, in all the use cases I can think of you have a stream of important > transactions, mixed with non-important ones like log messages that you want > to finish fast in a best-effort fashion. I'm actually tempted to tie that to > the existing synchronous_commit GUC, the use case seems exactly the > same. Well, that would be an over simplification. In my applications I set the "sessions" transaction with synchronous_commit = off, but the business transactions to synchronous_commit = on. Now, among those last, I have backoffice editing and money transactions. I'm not willing to be forced to endure the same performance penalty for both when I know the distributed durability needs aren't the same. > OTOH, if we do want fine-grained per-transaction control, a simple boolean > or even an enum GUC doesn't really cut it. For truly fine-grained control > you want to be able to specify exceptions like "wait until this is replayed > in slave named 'reporting'" or 'don't wait for acknowledgment from slave > named 'uk-server'". With standby registration, we can invent a syntax for > specifying overriding rules in the transaction. Something like SET > replication_exceptions = 'reporting=replay, uk-server=async'. Then you want to be able to have more than one reporting server and need only one of them at the "replay" level, but you don't need to know which it is. Or on the contrary you have a failover server and you want to be sure this one is at the replay level whatever happens. Then you want topology flexibility: you need to be able to replace a reporting server with another, ditto for the failover one. Did I tell you my current thinking on how to tackle that yet? :) Using a distributed setup, where each slave has a weight (several votes per transaction) and a level offering would allow that I think. Now something similar to your idea that I can see a need for is being able to have a multi-part quorum target: when you currently say that you want 2 votes for sync, you would be able to say you want 2 votes for recv, 2 for fsync and 1 for replay. Remember that any slave is setup to offer only one level of synchronicity but can offer multiple votes. How this would look like in the setup? Best would be to register the different service levels your application need. Time to bikeshed a little? sync_rep_services = {critical: recv=2, fsync=2, replay=1; important: fsync=3; reporting: recv=2, apply=1} Well you get the idea, it could maybe get stored on a catalog somewhere with nice SQL commands etc. The goal is then to be able to handle a much simpler GUC in the application, sync_rep_service = important for example. Reserved label would be off, the default value. > For the control between async/recv/fsync/replay, I like to think in terms of > a) asynchronous vs synchronous > b) if it's synchronous, how synchronous is it? recv, fsync or replay? Same here. > I think it makes most sense to set sync vs. async in the master, and the > level of synchronicity in the slave. Yeah, exactly. If you add a weight to each slave then a quorum commit, you don't change the implementation complexity and you offer lot of setup flexibility. If the slave sync-level and weight are SIGHUP, then it even become rather easy to switch roles online or to add new servers or to organise a maintenance window — the quorum to reach is a per-transaction GUC on the master, too, right? Regards, -- dim
On Fri, 2010-09-17 at 09:15 +0100, Simon Riggs wrote: > On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote: > > That said, there's a few small things that can be progressed > > regardless of the details of synchronous replication. There's the > > changes to trigger failover with a signal, and it seems that we'll > > need some libpq changes to allow acknowledgments to be sent back to > > the master regardless of the rest of the design. We can discuss those > > in separate threads in parallel. > > Agree to both of those points. But I don't agree that those things should be committed just yet. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 17/09/10 12:10, Dimitri Fontaine wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> * Support multiple standbys with various synchronization levels. >> >> * What happens if a synchronous standby isn't connected at the moment? >> Return immediately vs. wait forever. >> >> * Per-transaction control. Some transactions are important, others are not. >> >> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers >> can be seen as important special cases of this. >> >> * async, recv, fsync and replay levels of synchronization. >> >> So what should the user interface be like? Given the 1st and 2nd >> requirement, we need standby registration. If some standbys are important >> and others are not, the master needs to distinguish between them to be able >> to determine that a transaction is safely delivered to the important >> standbys. > > Well the 1st point can be handled in a distributed fashion, where the > sync level is setup at the slave. If the synchronicity is configured in the standby, how does the master know that there's a synchronous slave out there that it should wait for, if that slave isn't connected at the moment? >> OTOH, if we do want fine-grained per-transaction control, a simple boolean >> or even an enum GUC doesn't really cut it. For truly fine-grained control >> you want to be able to specify exceptions like "wait until this is replayed >> in slave named 'reporting'" or 'don't wait for acknowledgment from slave >> named 'uk-server'". With standby registration, we can invent a syntax for >> specifying overriding rules in the transaction. Something like SET >> replication_exceptions = 'reporting=replay, uk-server=async'. > > Then you want to be able to have more than one reporting server and need > only one of them at the "replay" level, but you don't need to know which > it is. Or on the contrary you have a failover server and you want to be > sure this one is at the replay level whatever happens. > > Then you want topology flexibility: you need to be able to replace a > reporting server with another, ditto for the failover one. > > Did I tell you my current thinking on how to tackle that yet? :) Using a > distributed setup, where each slave has a weight (several votes per > transaction) and a level offering would allow that I think. Yeah, the quorum stuff. That's all good, but doesn't change the way you would do per-transaction control. By specifying overrides on a per-transaction basis, you can have as fine-grained control as you possibly can. Anything you can specify in a configuration file can then also be specified per-transaction with overrides. The syntax just needs to be flexible enough. If we buy into the concept of per-transaction exceptions, we can put that issue aside for the moment, and just consider how to configure things in a config file. Anything you can express in the config file can also be expressed per-transaction with the exceptions GUC. > Now something similar to your idea that I can see a need for is being > able to have a multi-part quorum target: when you currently say that you > want 2 votes for sync, you would be able to say you want 2 votes for > recv, 2 for fsync and 1 for replay. Remember that any slave is setup to > offer only one level of synchronicity but can offer multiple votes. > > How this would look like in the setup? Best would be to register the > different service levels your application need. Time to bikeshed a > little? > > sync_rep_services = {critical: recv=2, fsync=2, replay=1; > important: fsync=3; > reporting: recv=2, apply=1} > > Well you get the idea, it could maybe get stored on a catalog somewhere > with nice SQL commands etc. The goal is then to be able to handle a much > simpler GUC in the application, sync_rep_service = important for > example. Reserved label would be off, the default value So ignoring the quorum stuff for a moment, the general idea is that you have predefined sets of configurations (or exceptions to the general config) specified in a config file, and in the application you just choose among those with "sync_rep_service=XXX". Yeah, I like that, it allows you to isolate the details of the topology from the application. > If you add a weight to each slave then a quorum commit, you don't change > the implementation complexity and you offer lot of setup flexibility. If > the slave sync-level and weight are SIGHUP, then it even become rather > easy to switch roles online or to add new servers or to organise a > maintenance window — the quorum to reach is a per-transaction GUC on the > master, too, right? I haven't bought into the quorum idea yet, but yeah, if we have quorum support, then it would be configurable on a per-transaction basis too with the above mechanism. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote: > (changed subject again.) > > On 17/09/10 10:06, Simon Riggs wrote: > > I don't think we can determine how far to implement without considering > > both approaches in detail. With regard to your points below, I don't > > think any of those points could be committed first. > > Yeah, I think we need to decide on the desired feature set first, before > we dig deeper into the the patches. The design and implementation will > fall out of that. Well, we've discussed these things many times and talking hasn't got us very far on its own. We need measurements and neutral assessments. The patches are simple and we have time. This isn't just about UI, there are significant and important differences between the proposals in terms of the capability and control they offer. I propose we develop both patches further and performance test them. Many of the features I have proposed are performance related and people need to be able to see what is important, and what is not. But not through mere discussion, we need numbers to show which things matter and which things don't. And those need to be derived objectively. > * Support multiple standbys with various synchronization levels. > > * What happens if a synchronous standby isn't connected at the moment? > Return immediately vs. wait forever. > > * Per-transaction control. Some transactions are important, others are not. > > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all > servers can be seen as important special cases of this. > > * async, recv, fsync and replay levels of synchronization. That's a reasonable starting list of points, there may be others. > So what should the user interface be like? Given the 1st and 2nd > requirement, we need standby registration. If some standbys are > important and others are not, the master needs to distinguish between > them to be able to determine that a transaction is safely delivered to > the important standbys. My patch provides those two requirements without standby registration, so we very clearly don't "need" standby registration. The question is do we want standby registration on master and if so, why? > For per-transaction control, ISTM it would be enough to have a simple > user-settable GUC like synchronous_commit. Let's call it > "synchronous_replication_commit" for now. If you wish to change the name of the GUC away from the one I have proposed, fine. Please note that aspect isn't important to me and I will happily concede all such points to the majority view. > For non-critical transactions, > you can turn it off. That's very simple for developers to understand and > use. I don't think we need more fine-grained control than that at > transaction level, in all the use cases I can think of you have a stream > of important transactions, mixed with non-important ones like log > messages that you want to finish fast in a best-effort fashion. Sounds like we're getting somewhere. See below. > I'm > actually tempted to tie that to the existing synchronous_commit GUC, the > use case seems exactly the same. http://archives.postgresql.org/pgsql-hackers/2008-07/msg01001.php Check the date! I think that particular point is going to confuse us. It will draw much bike shedding and won't help us decide between patches. It's a nicety that can be left to a time after we have the core feature committed. > OTOH, if we do want fine-grained per-transaction control, a simple > boolean or even an enum GUC doesn't really cut it. For truly > fine-grained control you want to be able to specify exceptions like > "wait until this is replayed in slave named 'reporting'" or 'don't wait > for acknowledgment from slave named 'uk-server'". With standby > registration, we can invent a syntax for specifying overriding rules in > the transaction. Something like SET replication_exceptions = > 'reporting=replay, uk-server=async'. > > For the control between async/recv/fsync/replay, I like to think in > terms of > a) asynchronous vs synchronous > b) if it's synchronous, how synchronous is it? recv, fsync or replay? > > I think it makes most sense to set sync vs. async in the master, and the > level of synchronicity in the slave. Although I have sympathy for the > argument that it's simpler if you configure it all from the master side > as well. I have catered for such requests by suggesting a plugin that allows you to implement that complexity without overburdening the core code. This strikes me as an "ad absurdum" argument. Since the above over-complexity would doubtless be seen as insane by Tom et al, it attempts to persuade that we don't need recv, fsync and apply either. Fujii has long talked about 4 levels of service also. Why change? I had thought that part was pretty much agreed between all of us. Without performance tests to demonstrate "why", these do sound hard to understand. But we should note that DRBD offers recv ("B") and fsync ("C") as separate options. And Oracle implements all 3 of recv, fsync and apply. Neither of them describe those options so simply and easily as the way we are proposing with a 4 valued enum (with async as the fourth option). If we have only one option for sync_rep = 'on' which of recv | fsync | apply would it implement? You don't mention that. Which do you choose? For what reason do you make that restriction? The code doesn't get any simpler, in my patch at least, from my perspective it would be a restriction without benefit. I no longer seek to persuade by words alone. The existence of my patch means that I think that only measurements and tests will show why I have been saying these things. We need performance tests. I'm not ready for them today, but will be very soon. I suspect you aren't either since from earlier discussions you didn't appear to have much about overall throughput, only about response times for single transactions. I'm happy to be proved wrong there. > Putting all of that together. I think Fujii-san's standby.conf is pretty > close. > What it needs is the additional GUC for transaction-level control. The difference between the patches is not a simple matter of a GUC. My proposal allows a single standby to provide efficient replies to multiple requested durability levels all at the same time. With efficient use of network resources. ISTM that because the other patch cannot provide that you'd like to persuade us that we don't need that, ever. You won't sell me on that point, cos I can see lots of uses for it. Another use case for you: * customer orders are important, but we want lots of them, so we use recv mode for those. * pricing data hardly ever changes, but when it does we need it to be applied across the cluster so we don't get read mismatches, so those rare transactions use apply mode. If you don't want multiple modes at once, you don't need to use that feature. But there is no reason to prevent people having the choice, when a design exists that can provide it. (A separate and later point, is that I would one day like to annotate specific tables and functions with different modes, so a sysadmin can point out which data is important at table level - which is what MySQL provides by allowing choice of storage engine for particular tables. Nobody cares about the specific engine, they care about the durability implications of those choices. This isn't part of the current proposal, just a later statement of direction.) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > If the synchronicity is configured in the standby, how does the master know > that there's a synchronous slave out there that it should wait for, if that > slave isn't connected at the moment? That's what quorum is trying to solve. The master knows how many votes per sync level the transaction needs. If no slave is acknowledging any vote, that's all you need to know to ROLLBACK (after the timeout), right? — if setup says so, on the master. > Yeah, the quorum stuff. That's all good, but doesn't change the way you > would do per-transaction control. That's when I bought in on the feature. It's all dynamic and distributed, and it offers per-transaction control. Regards, -- Dimitri Fontaine PostgreSQL DBA, Architecte
On Fri, 2010-09-17 at 12:30 +0300, Heikki Linnakangas wrote: > If the synchronicity is configured in the standby, how does the master > know that there's a synchronous slave out there that it should wait for, > if that slave isn't connected at the moment? That isn't a question you need standby registration to answer. In my proposal, the user requests a certain level of confirmation and will wait until timeout to see if it is received. The standby can crash and restart, come back and provide the answer, and it will still work. So it is the user request that informs the master that there would normally be a synchronous slave out there it should wait for. So far, I have added the point that if a user requests a level of confirmation that is currently unavailable, then it will use the highest level of confirmation available now. That stops us from waiting for timeout for every transaction we run if standby goes down hard, which just freezes the application for long periods to no real benefit. It also prevents applications from requesting durability levels the cluster cannot satisfy, in the opinion of the sysadmin, since the sysadmin specifies the max level on each standby. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 17/09/10 12:49, Simon Riggs wrote: > This isn't just about UI, there are significant and important > differences between the proposals in terms of the capability and control > they offer. Sure. The point of focusing on the UI is that the UI demonstrates what capability and control a proposal offers. >> So what should the user interface be like? Given the 1st and 2nd >> requirement, we need standby registration. If some standbys are >> important and others are not, the master needs to distinguish between >> them to be able to determine that a transaction is safely delivered to >> the important standbys. > > My patch provides those two requirements without standby registration, > so we very clearly don't "need" standby registration. It's still not clear to me how you would configure things like "wait for ack from reporting slave, but not other slaves" or "wait until replayed in the server on the west coast" in your proposal. Maybe it's possible, but doesn't seem very intuitive, requiring careful configuration in both the master and the slaves. In your proposal, you also need to be careful not to connect e.g a test slave with "synchronous_replication_service = apply" to the master, or it will possible shadow a real production slave, acknowledging transactions that are not yet received by the real slave. It's certainly possible to screw up with standby registration too, but you have more direct control of the master behavior in the master, instead of distributing it across all slaves. > The question is do we want standby registration on master and if so, > why? Well, aside from how to configure synchronous replication, standby registration would help with retaining the right amount of WAL in the master. wal_keep_segments doesn't guarantee that enough is retained, and OTOH when all standbys are connected you retain much more than might be required. Giving names to slaves also allows you to view their status in the master in a more intuitive format. Something like: postgres=# SELECT * FROM pg_slave_status ; name | connected | received | fsyncd | applied ------------+-----------+------------+------------+------------ reporting | t | 0/26000020 | 0/26000020 | 0/25550020ha-standby | t | 0/26000020 | 0/26000020 | 0/26000020 testserver | f | | 0/15000020| (3 rows) >> For the control between async/recv/fsync/replay, I like to think in >> terms of >> a) asynchronous vs synchronous >> b) if it's synchronous, how synchronous is it? recv, fsync or replay? >> >> I think it makes most sense to set sync vs. async in the master, and the >> level of synchronicity in the slave. Although I have sympathy for the >> argument that it's simpler if you configure it all from the master side >> as well. > > I have catered for such requests by suggesting a plugin that allows you > to implement that complexity without overburdening the core code. Well, plugins are certainly one possibility, but then we need to design the plugin API. I've been thinking along the lines of a proxy, which can implement whatever logic you want to decide when to send the acknowledgment. With a proxy as well, if we push any features people that want to a proxy or plugin, we need to make sure that the proxy/plugin has all the necessary information available. > This strikes me as an "ad absurdum" argument. Since the above > over-complexity would doubtless be seen as insane by Tom et al, it > attempts to persuade that we don't need recv, fsync and apply either. > > Fujii has long talked about 4 levels of service also. Why change? I had > thought that part was pretty much agreed between all of us. Now you lost me. I agree that we need 4 levels of service (at least ultimately, not necessarily in the first phase). > Without performance tests to demonstrate "why", these do sound hard to > understand. But we should note that DRBD offers recv ("B") and fsync > ("C") as separate options. And Oracle implements all 3 of recv, fsync > and apply. Neither of them describe those options so simply and easily > as the way we are proposing with a 4 valued enum (with async as the > fourth option). > > If we have only one option for sync_rep = 'on' which of recv | fsync | > apply would it implement? You don't mention that. Which do you choose? You would choose between recv, fsync and apply in the slave, with a GUC. > I no longer seek to persuade by words alone. The existence of my patch > means that I think that only measurements and tests will show why I have > been saying these things. We need performance tests. I don't expect any meaningful differences in terms of performance between any of the discussed options. The big question right now is what features we provide and how they're configured. Performance will depend primarily on the mode you use, and secondarily on the implementation of the mode. It would be completely premature to do performance testing yet IMHO. >> Putting all of that together. I think Fujii-san's standby.conf is pretty >> close. > >> What it needs is the additional GUC for transaction-level control. > > The difference between the patches is not a simple matter of a GUC. > > My proposal allows a single standby to provide efficient replies to > multiple requested durability levels all at the same time. With > efficient use of network resources. ISTM that because the other patch > cannot provide that you'd like to persuade us that we don't need that, > ever. You won't sell me on that point, cos I can see lots of uses for > it. Simon, how the replies are sent is an implementation detail I haven't given much thought yet. The reason we delved into that discussion earlier was that you seemed to contradict yourself with the claims that you don't need to send more than one reply per transaction, and that the standby doesn't need to know the synchronization level. Other than that the curiosity about that contradiction, it doesn't seem like a very interesting detail to me right now. It's not a question that drives the rest of the design, but the other way round. But FWIW, something like your proposal of sending 3 XLogRecPtrs in each reply seems like a good approach. I'm not sure about using walwriter. I can see that it helps with getting the 'recv' and 'replay' acknowledgments out faster, but I still have the scars from starting bgwriter during recovery. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndQuadrant.com> writes: > So far, I have added the point that if a user requests a level of > confirmation that is currently unavailable, then it will use the highest > level of confirmation available now. That stops us from waiting for > timeout for every transaction we run if standby goes down hard, which > just freezes the application for long periods to no real benefit. It > also prevents applications from requesting durability levels the cluster > cannot satisfy, in the opinion of the sysadmin, since the sysadmin > specifies the max level on each standby. That sounds like the commit-or-rollback when slave are gone question. I think this behavior should be user-setable, again per-transaction. I agree with you that the general case looks like your proposed default, but we already know that some will need "don't ack if not replied before the timeout", and they even will go as far as asking for it to be reported as a serialisation error of some sort, I guess… Regards, -- Dimitri Fontaine PostgreSQL DBA, Architecte
On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote: > On 17/09/10 12:49, Simon Riggs wrote: > > This isn't just about UI, there are significant and important > > differences between the proposals in terms of the capability and control > > they offer. > > Sure. The point of focusing on the UI is that the UI demonstrates what > capability and control a proposal offers. My patch does not include server registration. It could be added later on top of my patch without any issues. The core parts of my patch are the fine grained transaction-level control and the ability to mix them dynamically with good performance. To me server registration is not a core issue. I'm not actively against it, I just don't see the need for it at all. Certainly not committed first, especially since its not actually needed by either of our patches. Standby registration doesn't provide *any* parameter that can't be supplied from standby recovery.conf. The only thing standby registration allows you to do is know whether there was supposed to be a standby there, but yet it isn't there now. I don't see that point as being important because it seems strange to me to want to wait for a standby that ought to be there, but isn't anymore. What happens if it never comes back? Manual intervention required. (We agree on how to handle a standby that *is* "connected", yet never returns a reply or takes too long to do so). > >> So what should the user interface be like? Given the 1st and 2nd > >> requirement, we need standby registration. If some standbys are > >> important and others are not, the master needs to distinguish between > >> them to be able to determine that a transaction is safely delivered to > >> the important standbys. > > > > My patch provides those two requirements without standby registration, > > so we very clearly don't "need" standby registration. > > It's still not clear to me how you would configure things like "wait for > ack from reporting slave, but not other slaves" or "wait until replayed > in the server on the west coast" in your proposal. Maybe it's possible, > but doesn't seem very intuitive, requiring careful configuration in both > the master and the slaves. In the use cases we discussed we had simple 2 or 3 server configs. master standby1 - preferred sync target - set to recv, fsync or apply standby2 - non-preferred sync target, maybe test server - set to async So in the two cases you mention we might set "wait for ack from reporting slave" master: sync_replication = 'recv' #as default, can be changed reporting-slave: sync_replication_service = 'recv' #gives max level "wait until replayed in the server on the west coast" master: sync_replication = 'recv' #as default, can be changed west-coast: sync_replication_service = 'apply' #gives max level The absence of registration in my patch makes some things easier and some things harder. For example, you can add a new standby without editing the config on the master. If you had 2 standbys, both offering the same level of protection, my proposal would *not* allow you to specify that you preferred one master over another. But we could add a priority parameter as well if that's an issue. > In your proposal, you also need to be careful not to connect e.g a test > slave with "synchronous_replication_service = apply" to the master, or > it will possible shadow a real production slave, acknowledging > transactions that are not yet received by the real slave. It's certainly > possible to screw up with standby registration too, but you have more > direct control of the master behavior in the master, instead of > distributing it across all slaves. > > > The question is do we want standby registration on master and if so, > > why? > > Well, aside from how to configure synchronous replication, standby > registration would help with retaining the right amount of WAL in the > master. wal_keep_segments doesn't guarantee that enough is retained, and > OTOH when all standbys are connected you retain much more than might be > required. > > Giving names to slaves also allows you to view their status in the > master in a more intuitive format. Something like: We can give servers a name without registration. It actually makes more sense to set the name in the standby and it can be passed through from standby when we connect. I very much like the idea of server names and think this next SRF looks really cool. > postgres=# SELECT * FROM pg_slave_status ; > name | connected | received | fsyncd | applied > ------------+-----------+------------+------------+------------ > reporting | t | 0/26000020 | 0/26000020 | 0/25550020 > ha-standby | t | 0/26000020 | 0/26000020 | 0/26000020 > testserver | f | | 0/15000020 | > (3 rows) That could be added on top of my patch also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, Sep 17, 2010 at 6:41 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >>> So what should the user interface be like? Given the 1st and 2nd >>> requirement, we need standby registration. If some standbys are >>> important and others are not, the master needs to distinguish between >>> them to be able to determine that a transaction is safely delivered to >>> the important standbys. >> >> My patch provides those two requirements without standby registration, >> so we very clearly don't "need" standby registration. > > It's still not clear to me how you would configure things like "wait for ack > from reporting slave, but not other slaves" or "wait until replayed in the > server on the west coast" in your proposal. Maybe it's possible, but doesn't > seem very intuitive, requiring careful configuration in both the master and > the slaves. Agreed. I think this will be much simpler if all the configuration is in one place (on the master). > In your proposal, you also need to be careful not to connect e.g a test > slave with "synchronous_replication_service = apply" to the master, or it > will possible shadow a real production slave, acknowledging transactions > that are not yet received by the real slave. It's certainly possible to > screw up with standby registration too, but you have more direct control of > the master behavior in the master, instead of distributing it across all > slaves. Similarly agreed. >> The question is do we want standby registration on master and if so, >> why? > > Well, aside from how to configure synchronous replication, standby > registration would help with retaining the right amount of WAL in the > master. wal_keep_segments doesn't guarantee that enough is retained, and > OTOH when all standbys are connected you retain much more than might be > required. +1. > Giving names to slaves also allows you to view their status in the master in > a more intuitive format. Something like: > > postgres=# SELECT * FROM pg_slave_status ; > name | connected | received | fsyncd | applied > ------------+-----------+------------+------------+------------ > reporting | t | 0/26000020 | 0/26000020 | 0/25550020 > ha-standby | t | 0/26000020 | 0/26000020 | 0/26000020 > testserver | f | | 0/15000020 | > (3 rows) +1. Having said all of the above, I am not in favor your (Heikki's) proposal to configure sync/async on the slave and the level on the master. That seems like a somewhat bizarre division of labor, splitting what is essentially one setting across two machines. >>> For the control between async/recv/fsync/replay, I like to think in >>> terms of >>> a) asynchronous vs synchronous >>> b) if it's synchronous, how synchronous is it? recv, fsync or replay? >>> >>> I think it makes most sense to set sync vs. async in the master, and the >>> level of synchronicity in the slave. Although I have sympathy for the >>> argument that it's simpler if you configure it all from the master side >>> as well. >> >> I have catered for such requests by suggesting a plugin that allows you >> to implement that complexity without overburdening the core code. > > Well, plugins are certainly one possibility, but then we need to design the > plugin API. I've been thinking along the lines of a proxy, which can > implement whatever logic you want to decide when to send the acknowledgment. > With a proxy as well, if we push any features people that want to a proxy or > plugin, we need to make sure that the proxy/plugin has all the necessary > information available. I'm not really sold on the proxy idea. That seems like it adds a lot of configuration complexity, not to mention additional hops. Of course, the plug-in idea also won't be suitable for any but the most advanced users. I think of the two I prefer the idea of a plug-in, slightly, but maybe this doesn't have to be done in version 1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > The only thing standby registration allows you to do is know whether > there was supposed to be a standby there, but yet it isn't there now. I > don't see that point as being important because it seems strange to me > to want to wait for a standby that ought to be there, but isn't anymore. > What happens if it never comes back? Manual intervention required. > > (We agree on how to handle a standby that *is* "connected", yet never > returns a reply or takes too long to do so). Doesn't Oracle provide a mode where it shuts down if this occurs? > The absence of registration in my patch makes some things easier and > some things harder. For example, you can add a new standby without > editing the config on the master. That's actually one of the reasons why I like the idea of registration. It seems rather scary to add a new standby without editing the config on the master. Actually, adding a new fully-async slave without touching the master seems reasonable, but adding a new sync slave without touching the master gives me the willies. The behavior of the system could change quite sharply when you do this, and it might not be obvious what has happened. (Imagine DBA #1 makes the change and DBA #2 is then trying to figure out what's happened - he checks the configs of all the machines he knows about and finds them all unchanged... head-scratching ensues.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Sep 17, 2010 at 7:41 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> The question is do we want standby registration on master and if so, >> why? > > Well, aside from how to configure synchronous replication, standby > registration would help with retaining the right amount of WAL in the > master. wal_keep_segments doesn't guarantee that enough is retained, and > OTOH when all standbys are connected you retain much more than might be > required. Yep. And standby registration is required when we support "wait forever when synchronous standby isn't connected at the moment" option that Heikki explained upthread. Though I don't think that standby registration is required in the first phase since "wait forever" option is not used in basic use case. Synchronous replication is basically used to reduce the downtime, and "wait forever" option opposes that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The only thing standby registration allows you to do is know whether > there was supposed to be a standby there, but yet it isn't there now. I > don't see that point as being important because it seems strange to me > to want to wait for a standby that ought to be there, but isn't anymore. According to what I heard, some people want to guarantee that all the transactions are *always* written in *all* the synchronous standbys. IOW, they want to keep the transaction waiting until it has been written in all the synchronous standbys. Standby registration is required to support such a use case. Without the registration, the master cannot determine whether the transaction has been written in all the synchronous standbys. > What happens if it never comes back? Manual intervention required. Yep. > In the use cases we discussed we had simple 2 or 3 server configs. > > master > standby1 - preferred sync target - set to recv, fsync or apply > standby2 - non-preferred sync target, maybe test server - set to async > > So in the two cases you mention we might set > > "wait for ack from reporting slave" > master: sync_replication = 'recv' #as default, can be changed > reporting-slave: sync_replication_service = 'recv' #gives max level > > "wait until replayed in the server on the west coast" > master: sync_replication = 'recv' #as default, can be changed > west-coast: sync_replication_service = 'apply' #gives max level What synchronization level does each combination of sync_replication and sync_replication_service lead to? I'd like to see something like the following table. sync_replication | sync_replication_service | result ------------------+--------------------------+--------async | async | ???async |recv | ???async | fsync | ???async | apply | ???recv | async | ???... Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: > What synchronization level does each combination of sync_replication > and sync_replication_service lead to? I'd like to see something like > the following table. > > sync_replication | sync_replication_service | result > ------------------+--------------------------+-------- > async | async | ??? > async | recv | ??? > async | fsync | ??? > async | apply | ??? > recv | async | ??? > ... Good question. There are only 4 possible outcomes. There is no combination, so we don't need a table like that above. The "service" specifies the highest request type available from that specific standby. If someone requests a higher service than is currently offered by this standby, they will either a) get that service from another standby that does offer that level b) automatically downgrade the sync rep mode to the highest available. For example, if you request recv but there is only one standby and it only offers async, then you get downgraded to async. In all cases, if you request async then we act same as 9.0. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers > can be seen as important special cases of this. I think that we should skip quorum commit at the first phase because the design seems to be still poorly-thought-out. I'm concerned about the case where the faster synchronous standby goes down and the lagged synchronous one remains when n=1. In this case, some transactions marked as committed in a client might not be replicated to the remaining synchronous standby yet. What if the master goes down at this point? How can we determine whether promoting the remaining standby to the master causes data loss? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Sep 17, 2010 at 8:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers >> can be seen as important special cases of this. > > I think that we should skip quorum commit at the first phase > because the design seems to be still poorly-thought-out. > > I'm concerned about the case where the faster synchronous standby > goes down and the lagged synchronous one remains when n=1. In this > case, some transactions marked as committed in a client might not > be replicated to the remaining synchronous standby yet. What if > the master goes down at this point? How can we determine whether > promoting the remaining standby to the master causes data loss? Yep. That issue has been raised before, and I think it's quite valid. That's not to say the feature isn't valid, but I think trying to include it in the first commit is going to lead to endless wrangling about design. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: > On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > The only thing standby registration allows you to do is know whether > > there was supposed to be a standby there, but yet it isn't there now. I > > don't see that point as being important because it seems strange to me > > to want to wait for a standby that ought to be there, but isn't anymore. > > According to what I heard, some people want to guarantee that all the > transactions are *always* written in *all* the synchronous standbys. > IOW, they want to keep the transaction waiting until it has been written > in all the synchronous standbys. Standby registration is required to > support such a use case. Without the registration, the master cannot > determine whether the transaction has been written in all the synchronous > standbys. You don't need standby registration at all. You can do that with a single parameter, already proposed: quorum_commit = N. But most people said they didn't want it. If they do we can put it back later. I don't think we're getting anywhere here. I just don't see any *need* to have it. Some people might *want* to set things up that way, and if that's true, that's enough for me to agree with them. The trouble is, I know some people have said they *want* to set it in the standby and we definitely *need* to set it somewhere. After this discussion, I think "both" is easily done and quite cool. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-09-17 at 20:56 +0900, Fujii Masao wrote: > On Fri, Sep 17, 2010 at 7:41 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >> The question is do we want standby registration on master and if so, > >> why? > > > > Well, aside from how to configure synchronous replication, standby > > registration would help with retaining the right amount of WAL in the > > master. wal_keep_segments doesn't guarantee that enough is retained, and > > OTOH when all standbys are connected you retain much more than might be > > required. > > Yep. Setting wal_keep_segments is difficult, but its not a tunable. The sysadmin needs to tell us what is the maximum number of files she'd like to keep. Otherwise we may fill up a disk, use space intended for use by another app, etc.. The server cannot determine what limits the sysadmin may wish to impose. The only sane default is 0, because "store everything, forever" makes no sense. Similarly, if we register a server, it goes down and we forget to deregister it then we will attempt to store everything, forever and our system will go down. The bigger problem is base backups, not server restarts. We don't know how to get that right because we don't register base backups automatically. If we did dynamically alter the number of WALs we store then we'd potentially screw up new base backups. Server registration won't help with that at all, so you'd need to add a base backup registration scheme as well. But even if you had that, you'd still need a "max" setting defined by sysadmin. So the only sane thing to do is to set wal_keep_segments as high as possible. And doing that doesn't need server reg. > And standby registration is required when we support "wait forever when > synchronous standby isn't connected at the moment" option that Heikki > explained upthread. Though I don't think that standby registration is > required in the first phase since "wait forever" option is not used in > basic use case. Synchronous replication is basically used to reduce the > downtime, and "wait forever" option opposes that. Agreed, but I'd say "if" we support that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote: > On 17/09/10 12:49, Simon Riggs wrote: > > Fujii has long talked about 4 levels of service also. Why change? I had > > thought that part was pretty much agreed between all of us. > > Now you lost me. I agree that we need 4 levels of service (at least > ultimately, not necessarily in the first phase). OK, good. > > Without performance tests to demonstrate "why", these do sound hard to > > understand. But we should note that DRBD offers recv ("B") and fsync > > ("C") as separate options. And Oracle implements all 3 of recv, fsync > > and apply. Neither of them describe those options so simply and easily > > as the way we are proposing with a 4 valued enum (with async as the > > fourth option). > > > > If we have only one option for sync_rep = 'on' which of recv | fsync | > > apply would it implement? You don't mention that. Which do you choose? > > You would choose between recv, fsync and apply in the slave, with a GUC. So you would have both registration on the master and parameter settings on the standby? I doubt you mean that, so possibly need more explanation there for me to understand what you mean and also why you would do that. > > I no longer seek to persuade by words alone. The existence of my patch > > means that I think that only measurements and tests will show why I have > > been saying these things. We need performance tests. > > I don't expect any meaningful differences in terms of performance > between any of the discussed options. The big question right now is... This is the critical point. Politely, I would observe that *You* do not think there is a meaningful difference. *I* do, and evidence suggests that both Oracle and DRBD think so too. So we differ on what the "big question" is here. It's sounding to me that if we don't know these things, then we're quite a long way from committing something. This is basic research. > what > features we provide and how they're configured. Performance will depend > primarily on the mode you use, and secondarily on the implementation of > the mode. It would be completely premature to do performance testing yet > IMHO. If a patch is "ready" then we should be able to performance test it *before* we commit it. From what you say it sounds like Fujii's patch might yet require substantial tuning, so it might even be the case that my patch is closer in terms of readiness to commit. Whatever the case, we have two patches and I can't see any benefit in avoiding performance tests. > >> Putting all of that together. I think Fujii-san's standby.conf is pretty > >> close. > > > >> What it needs is the additional GUC for transaction-level control. > > > > The difference between the patches is not a simple matter of a GUC. > > > > My proposal allows a single standby to provide efficient replies to > > multiple requested durability levels all at the same time. With > > efficient use of network resources. ISTM that because the other patch > > cannot provide that you'd like to persuade us that we don't need that, > > ever. You won't sell me on that point, cos I can see lots of uses for > > it. > > Simon, how the replies are sent is an implementation detail I haven't > given much thought yet. It seems clear we've thought about different details around these topics. Now I understand your work on latches, I see it is an important contribution and I very much respect that. IMHO, each of us has seen something important that the other has not. > The reason we delved into that discussion > earlier was that you seemed to contradict yourself with the claims that > you don't need to send more than one reply per transaction, and that the > standby doesn't need to know the synchronization level. Other than that > the curiosity about that contradiction, it doesn't seem like a very > interesting detail to me right now. It's not a question that drives the > rest of the design, but the other way round. There was no contradiction. You just didn't understand how it could be possible, so dismissed it. It's a detail, yes. Some are critical, some are not. (e.g. latches.) My view is that it is critical and drives the design. So I don't agree with you on "the other way around". > But FWIW, something like your proposal of sending 3 XLogRecPtrs in each > reply seems like a good approach. I'm not sure about using walwriter. I > can see that it helps with getting the 'recv' and 'replay' > acknowledgments out faster, but > I still have the scars from starting > bgwriter during recovery. I am happy to apologise for those problems. I was concentrating on HS at the time, not on that aspect. You sorted out those problems for me and I thank you for that. With that in mind, I will remove the aspect of my patch that relate to starting wal writer. Small amount of code only. That means we will effectively disable recv mode for now, but I definitely want to be able to put it back later. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-09-17 at 21:43 +0900, Fujii Masao wrote: > On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers > > can be seen as important special cases of this. > > I think that we should skip quorum commit at the first phase > because the design seems to be still poorly-thought-out. Agreed > I'm concerned about the case where the faster synchronous standby > goes down and the lagged synchronous one remains when n=1. In this > case, some transactions marked as committed in a client might not > be replicated to the remaining synchronous standby yet. What if > the master goes down at this point? How can we determine whether > promoting the remaining standby to the master causes data loss? In that config if the faster sync standby goes down then your application performance goes down dramatically. That would be fragile. So you would set up like this master - requests are > async standby1 - fast - so use recv | fsync | apply standby2 - async So if standby1 goes down we don't wait for standby2, but we do continue to stream to it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 17/09/10 15:56, Simon Riggs wrote: > On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote: >> On 17/09/10 12:49, Simon Riggs wrote: >>> Without performance tests to demonstrate "why", these do sound hard to >>> understand. But we should note that DRBD offers recv ("B") and fsync >>> ("C") as separate options. And Oracle implements all 3 of recv, fsync >>> and apply. Neither of them describe those options so simply and easily >>> as the way we are proposing with a 4 valued enum (with async as the >>> fourth option). >>> >>> If we have only one option for sync_rep = 'on' which of recv | fsync | >>> apply would it implement? You don't mention that. Which do you choose? >> >> You would choose between recv, fsync and apply in the slave, with a GUC. > > So you would have both registration on the master and parameter settings > on the standby? I doubt you mean that, so possibly need more explanation > there for me to understand what you mean and also why you would do that. Yes, that's what I meant. No-one else seems to think that's a good idea :-). >> I don't expect any meaningful differences in terms of performance >> between any of the discussed options. The big question right now is... > > This is the critical point. Politely, I would observe that *You* do not > think there is a meaningful difference. *I* do, and evidence suggests > that both Oracle and DRBD think so too. So we differ on what the "big > question" is here. We must be talking about different things again. There's certainly big differences in the different synchronization levels and configurations, but I don't expect there to be big performance differences between patches to implement those levels. Once we got rid of the polling loops, I expect the network and disk latencies to dominate. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Robert Haas <robertmhaas@gmail.com> [100917 07:44]: > On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > The only thing standby registration allows you to do is know whether > > there was supposed to be a standby there, but yet it isn't there now. I > > don't see that point as being important because it seems strange to me > > to want to wait for a standby that ought to be there, but isn't anymore. > > What happens if it never comes back? Manual intervention required. > > The absence of registration in my patch makes some things easier and > > some things harder. For example, you can add a new standby without > > editing the config on the master. > > That's actually one of the reasons why I like the idea of > registration. It seems rather scary to add a new standby without > editing the config on the master. Actually, adding a new fully-async > slave without touching the master seems reasonable, but adding a new > sync slave without touching the master gives me the willies. The > behavior of the system could change quite sharply when you do this, > and it might not be obvious what has happened. (Imagine DBA #1 makes > the change and DBA #2 is then trying to figure out what's happened - > he checks the configs of all the machines he knows about and finds > them all unchanged... head-scratching ensues.) So, those both give me the willies too... I've had a rack loose all power. Now, let's say I've got two servers (plus trays of disks for each) in the same rack. Ya, I know, I should move them to separate racks, preferably in separate buildings on the same campus, but realistically... I want to have them configured in a fsync WAL/style sync rep, I want to make sure that if the master comes up first after I get power back, it's not going to be claiming transactions are committed while the slave (which happens to have 4x the disks because it keeps PITR backups for a period too) it still chugging away on SCSI probes yet, not gotten to having PostgreSQL up yet... And I want to make sure the dev box that was testing another slave setup on, which is running in some test area by some other DBA, but not in the same rack, *can't* through some mis-configuration make my master think that it's production slave has properly fsync'ed the replicated WAL. </hopes & dreams> -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
* Fujii Masao <masao.fujii@gmail.com> [100917 07:57]: > Synchronous replication is basically used to reduce the > downtime, and "wait forever" option opposes that. Hm... I'm not sure I'ld agree with that. I'ld rather have some downtime, and my data available, then have less downtime, but find that I'm missing valuable data that was committed, but happend to not be replicated because no slave was available "yet". Sync rep is about "data availability", "data recoverability", *and* "downtime". The three are definitely related, but each use has their own tradeoffs. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote: > I want to have them configured in a fsync WAL/style sync rep, I want to > make sure that if the master comes up first after I get power back, it's > not going to be claiming transactions are committed while the slave > (which happens to have 4x the disks because it keeps PITR backups for a > period too) it still chugging away on SCSI probes yet, not gotten to > having PostgreSQL up yet... Nobody has mentioned the ability to persist the not-committed state across a crash before, and I think it's an important discussion point. We already have it: its called "two phase commit". (2PC) If you run 2PC on 3 servers and one goes down, you can just commit the in-flight transactions and continue. But it doesn't work on hot standby. It could: If we want that we could prepare the transaction on the master and don't allow commit until we get positive confirmation from standby. All of the machinery is there. I'm not sure if that's a 5th sync rep mode, or that idea is actually good enough to replace all the ideas we've had up until now. I would say probably not, but we should think about this. A slightly modified idea would be avoid writing the transaction prepare file as a separate file, just write the WAL for the prepare. We then remember the LSN of the prepare so we can re-access the WAL copy of it by re-reading the WAL files on master. Make sure we don't get rid of WAL that refers to waiting transactions. That would then give us the option to commit or abort depending upon whether we receive a reply within timeout. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote: > >> I want to have them configured in a fsync WAL/style sync rep, I want to >> make sure that if the master comes up first after I get power back, it's >> not going to be claiming transactions are committed while the slave >> (which happens to have 4x the disks because it keeps PITR backups for a >> period too) it still chugging away on SCSI probes yet, not gotten to >> having PostgreSQL up yet... > > Nobody has mentioned the ability to persist the not-committed state > across a crash before, and I think it's an important discussion point. Eh? I think all Aidan is asking for is the ability to have a mode where sync rep is really always sync, or nothing commits. Rather than timing out and continuing merrily on its way... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
* Robert Haas <robertmhaas@gmail.com> [100917 11:24]: > On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote: > > > >> I want to have them configured in a fsync WAL/style sync rep, I want to > >> make sure that if the master comes up first after I get power back, it's > >> not going to be claiming transactions are committed while the slave > >> (which happens to have 4x the disks because it keeps PITR backups for a > >> period too) it still chugging away on SCSI probes yet, not gotten to > >> having PostgreSQL up yet... > > > > Nobody has mentioned the ability to persist the not-committed state > > across a crash before, and I think it's an important discussion point. > > Eh? I think all Aidan is asking for is the ability to have a mode > where sync rep is really always sync, or nothing commits. Rather than > timing out and continuing merrily on its way... Right, I'm not asking for a "new" mode. I'm just hope that there will be a way to guarantee my "sync rep" is actually replicating. Having it "not replicate" simply because no slave has (yet) connected means I have to dance jigs around pg_hba.conf so that it won't allow non-replication connections until I've manual verified that the replication slave is connected... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2010-09-17 at 11:30 -0400, Aidan Van Dyk wrote: > * Robert Haas <robertmhaas@gmail.com> [100917 11:24]: > > On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote: > > > > > >> I want to have them configured in a fsync WAL/style sync rep, I want to > > >> make sure that if the master comes up first after I get power back, it's > > >> not going to be claiming transactions are committed while the slave > > >> (which happens to have 4x the disks because it keeps PITR backups for a > > >> period too) it still chugging away on SCSI probes yet, not gotten to > > >> having PostgreSQL up yet... > > > > > > Nobody has mentioned the ability to persist the not-committed state > > > across a crash before, and I think it's an important discussion point. > > > > Eh? I think all Aidan is asking for is the ability to have a mode > > where sync rep is really always sync, or nothing commits. Rather than > > timing out and continuing merrily on its way... > > Right, I'm not asking for a "new" mode. I'm just hope that there will > be a way to guarantee my "sync rep" is actually replicating. Having it > "not replicate" simply because no slave has (yet) connected means I have > to dance jigs around pg_hba.conf so that it won't allow non-replication > connections until I've manual verified that the replication slave > is connected... I agree that aspect is a problem. One solution, to me, would be to have a directive included in the pg_hba.conf that says entries below it are only allowed if it passes the test. So your hba file looks like this local postgres postgres host replication ... need replication host any any So the "need" test is an extra option in the first column. We might want additional "need" tests before we allow other rules also. Text following the "need" verb will be additional info for that test, sufficient to allow some kind of execution on the backend. I definitely don't like the idea that anyone that commits will just sit there waiting until the standby comes up. That just sounds an insane way of doing it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-09-17 at 16:09 +0300, Heikki Linnakangas wrote: > >> I don't expect any meaningful differences in terms of performance > >> between any of the discussed options. The big question right now is... > > > > This is the critical point. Politely, I would observe that *You* do not > > think there is a meaningful difference. *I* do, and evidence suggests > > that both Oracle and DRBD think so too. So we differ on what the "big > > question" is here. > > We must be talking about different things again. There's certainly big > differences in the different synchronization levels and configurations, > but I don't expect there to be big performance differences between > patches to implement those levels. Once we got rid of the polling loops, > I expect the network and disk latencies to dominate. So IIUC you seem to agree with * 4 levels of synchronous replication (specified on master) * transaction-controlled replication from the master * sending 3 LSN values back from standby Well, then that pretty much is my patch, except for the parameter UI. Did I misunderstand? We also agree that we need a standby to master protocol change; I used Zoltan's directly and I've had zero problems with it in testing. The only disagreement has been about * the need for standby registration (I understand "want") which seems to boil down to whether we wait for servers that *ought* to be there, but currently aren't. * whether to have wal writer active (I'm happy to add that later in this release, so we get the "recv" option also) * whether we have a parameter for quorum commit > 1 (happy to add later) Not sure if there is debate about whether quorum_commit = 1 is the default. * whether we provide replication_exceptions as core feature or as a plugin The only area of doubt is when we send replies, which you haven't thought about yet. So presumably you've no design-level objection to what I've proposed. Things we all seem to like are * different standbys can offer different sync levels * standby names * a set returning function which tells you current LSNs of all standbys * the rough idea of being able to specify a "service" and have that equate to a more complex config underneath the covers, without needing to have the application know the details - I think we need more details on that before we could say "we agree". So seems like a good days work. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 2010-09-17 10:09, Heikki Linnakangas wrote:<br /><span style="white-space: pre;">> I think it makes most sense toset sync vs. async in the master, and <br /> > the level of synchronicity in the slave. Although I have sympathy<br/> > for the argument that it's simpler if you configure it all from the <br /> > master side as well.</span><br/><br /> Just a comment as a sysadmin, It would be hugely beneficial if the<br /> master and slaves all wasable to run from the "exact same" configuration<br /> file. This would leave out any doubt of the configuration of the"complete cluster"<br /> in terms of debugging. Slave would be able to just "copy" over the masters<br /> configuration,etc. etc. <br /><br /> I dont know if it is doable or has any huge backsides. <br /><br /> -- <br /> Jesper<br/>
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: >> According to what I heard, some people want to guarantee that all the >> transactions are *always* written in *all* the synchronous standbys. > > You don't need standby registration at all. You can do that with a > single parameter, already proposed: > > quorum_commit = N. I think you also need another parameter to control the behavior upon timeout. You received less than N votes, now what? You're current idea seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC set at the transaction level. As far as registration goes, I see no harm to have the master maintain a list of known standby systems, of course, it's just maintaining that list from the master that I don't understand the use case for. Regards, -- dim
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: >> What synchronization level does each combination of sync_replication >> and sync_replication_service lead to? > > There are only 4 possible outcomes. There is no combination, so we don't > need a table like that above. > > The "service" specifies the highest request type available from that > specific standby. If someone requests a higher service than is currently > offered by this standby, they will either > a) get that service from another standby that does offer that level > b) automatically downgrade the sync rep mode to the highest available. I like the a) part, I can't say the same about the b) part. There's no reason to accept to COMMIT a transaction when the requested durability is known not to have been reached, unless the user said so. > For example, if you request recv but there is only one standby and it > only offers async, then you get downgraded to async. If so you choose, but with a net slowdown as you're now reaching the timeout for each transaction, with what I have in mind, and I don't see how you can avoid that. Even if you setup the replication from the master, you still can mess it up the same way, right? Regards, -- dim
On Fri, 2010-09-17 at 21:32 +0200, Dimitri Fontaine wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: > >> According to what I heard, some people want to guarantee that all the > >> transactions are *always* written in *all* the synchronous standbys. > > > > You don't need standby registration at all. You can do that with a > > single parameter, already proposed: > > > > quorum_commit = N. > > I think you also need another parameter to control the behavior upon > timeout. You received less than N votes, now what? You're current idea > seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC > set at the transaction level. I've said COMMIT with no option because I believe that we have only two choices: commit or wait (perhaps forever), and IMHO waiting is not good. We can't ABORT, because we sent a commit to the standby. If we abort, then we're saying the standby can't ever come back because it will have received and potentially replayed a different transaction history. I had some further thoughts around that but you end up with the byzantine generals problem always. Waiting might sound attractive. In practice, waiting will make all of your connections lock up and it will look to users as if their master has stopped working as well. (It has!). I can't imagine why anyone would ever want an option to select that; its the opposite of high availability. Just sounds like a serious footgun. Having said that Oracle offers Maximum Protection mode, which literally shuts down the master when you lose a standby. I can't say anything apart from "LOL". > As far as registration goes, I see no harm to have the master maintain a > list of known standby systems, of course, it's just maintaining that > list from the master that I don't understand the use case for. Yes, the master needs to know about all currently connected standbys. The only debate is what happens about ones that "ought" to be there. Given my comments above, I don't see the need. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > I've said COMMIT with no option because I believe that we have only two > choices: commit or wait (perhaps forever), and IMHO waiting is not good. > > We can't ABORT, because we sent a commit to the standby. Ah yes, I keep forgetting Sync Rep is not about 2PC. Sorry about that. > Waiting might sound attractive. In practice, waiting will make all of > your connections lock up and it will look to users as if their master > has stopped working as well. (It has!). I can't imagine why anyone would > ever want an option to select that; its the opposite of high > availability. Just sounds like a serious footgun. I guess that if there's a timeout GUC it can still be set to infinite somehow. Unclear as the use case might be. Regards, -- dim
On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Waiting might sound attractive. In practice, waiting will make all of > your connections lock up and it will look to users as if their master > has stopped working as well. (It has!). I can't imagine why anyone would > ever want an option to select that; its the opposite of high > availability. Just sounds like a serious footgun. Nevertheless, it seems that some people do want exactly that behavior, no matter how crazy it may seem to you. I'm not exactly sure what we're in disagreement about, TBH. You've previously said that you don't think standby registration is necessary, but that you don't object to it if others want it. So it seems like this might be mostly academic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
All, I'm answering this strictly from the perspective of my company's customers and what they've asked for. It does not reflect on what features are reflected in whatever patch. > * Support multiple standbys with various synchronization levels. Essential. We already have two customers who want to have one synch and several async standbys. > * What happens if a synchronous standby isn't connected at the moment? > Return immediately vs. wait forever. Essential. Actually, we need a replication_return_timeout. That is, wait X seconds on the standby and then give up. Again, in the systems I'm working with, we'd want to wait 5 seconds and then abort replication. > * Per-transaction control. Some transactions are important, others are not. Low priority. I see this as a 9.2 feature. Nobody I know is asking for it yet, and I think we need to get the other stuff right first. > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all > servers can be seen as important special cases of this. Medium priority. This would go together with having a registry of standbies. The only reason I don't call this low priority is that it would catapult PostgreSQL into the realm of CAP databases, assuming that we could deal with the re-mastering issue as well. > * async, recv, fsync and replay levels of synchronization. Fsync vs. Replay is low priority (as in, we could live with just one or the other), but the others are all high priority. Again, this should be settable *per standby*. > So what should the user interface be like? Given the 1st and 2nd > requirement, we need standby registration. If some standbys are > important and others are not, the master needs to distinguish between > them to be able to determine that a transaction is safely delivered to > the important standbys. There are considerable benefits to having a standby registry with a table-like interface. Particularly, one where we could change replication via UPDATE (or ALTER STANDBY) statements. a) we could eliminate a bunch of GUCs and control standby behavior instead via the table interface. b) DBAs and monitoring tools could see at a glance what the status of their replication network was. c) we could easily add new features (like quorum groups) without breaking prior setups. d) it would become easy rather than a PITA to construct GUI replication management tools. e) as previously mentioned, we could use it to have far more intelligent control over what WAL segments to keep, both on the master and in some distributed archive. Note, however, that the data from this pseudo-table would need to be replicated to the standby servers somehow in order to support re-mastering. Take all the above with a grain of salt, though. The important thing is to get *some kind* of synch rep into 9.1, and get 9.1 out on time. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote: > There are considerable benefits to having a standby registry with a > table-like interface. Particularly, one where we could change > replication via UPDATE (or ALTER STANDBY) statements. I think that using a system catalog for this is going to be a non-starter, but we could use a flat file that is designed to be machine-editable (and thus avoid repeating the mistake we've made with postgresql.conf). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
> I think that using a system catalog for this is going to be a > non-starter, Technically improbable? Darn. > but we could use a flat file that is designed to be > machine-editable (and thus avoid repeating the mistake we've made with > postgresql.conf). Well, even if we can't update it through the command line, at least the existing configuration (and node status) ought to be queryable. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sat, 2010-09-18 at 14:42 -0700, Josh Berkus wrote: > > * Per-transaction control. Some transactions are important, others are not. > > Low priority. > I see this as a 9.2 feature. Nobody I know is asking for it yet, and I > think we need to get the other stuff right first. I understand completely why anybody that has never used sync replication would think per-transaction control is a small deal. I fully expect your clients to try sync rep and then 5 minutes later say "Oh Crap, this sync rep is so slow it's unusable. Isn't there a way to tune it?". I've designed a way to tune sync rep so it is usable and useful. And putting that feature into 9.1 costs very little, if anything. My patch to do this is actually smaller than any other attempt to implement this and I claim faster too. You don't need to use the per-transaction controls, but they'll be there if you need them. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
> I've designed a way to tune sync rep so it is usable and useful. And > putting that feature into 9.1 costs very little, if anything. My patch > to do this is actually smaller than any other attempt to implement this > and I claim faster too. You don't need to use the per-transaction > controls, but they'll be there if you need them. Well, if you already have the code, that's a different story ... -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 18/09/10 22:59, Robert Haas wrote: > On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> Waiting might sound attractive. In practice, waiting will make all of >> your connections lock up and it will look to users as if their master >> has stopped working as well. (It has!). I can't imagine why anyone would >> ever want an option to select that; its the opposite of high >> availability. Just sounds like a serious footgun. > > Nevertheless, it seems that some people do want exactly that behavior, > no matter how crazy it may seem to you. Yeah, I agree with both of you. I have a hard time imaging a situation where you would actually want that. It's not high availability, it's high durability. When a transaction is acknowledged as committed, you know it's never ever going to disappear even if a meteor strikes the current master server within the next 10 milliseconds. In practice, people want high availability instead. That said, the timeout option also feels a bit wishy-washy to me. With a timeout, acknowledgment of a commit means "your transaction is safely committed in the master and slave. Or not, if there was some glitch with the slave". That doesn't seem like a very useful guarantee; if you're happy with that why not just use async replication? However, the "wait forever" behavior becomes useful if you have a monitoring application outside the DB that decides when enough is enough and tells the DB that the slave can be considered dead. So "wait forever" actually means "wait until I tell you that you can give up". The monitoring application can STONITH to ensure that the slave stays down, before letting the master proceed with the commit. With that in mind, we have to make sure that a transaction that's waiting for acknowledgment of the commit from a slave is woken up if the configuration changes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 19/09/10 01:20, Robert Haas wrote: > On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus<josh@agliodbs.com> wrote: >> There are considerable benefits to having a standby registry with a >> table-like interface. Particularly, one where we could change >> replication via UPDATE (or ALTER STANDBY) statements. > > I think that using a system catalog for this is going to be a > non-starter, but we could use a flat file that is designed to be > machine-editable (and thus avoid repeating the mistake we've made with > postgresql.conf). Yeah, that needs some careful design. We also need to record transient information about each slave, like how far it has received WAL already. Ideally that information would survive database restart too, but maybe we can live without that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On 09/17/2010 01:56 PM, Fujii Masao wrote: > And standby registration is required when we support "wait forever when > synchronous standby isn't connected at the moment" option that Heikki > explained upthread. That requirement can be reduced to say that the master only needs to known how many synchronous standbys *should* be connected. IIUC that's pretty much exactly the quorum_commit GUC that Simon proposed, because it doesn't make sense to have more synchronous standbys connected than quorum_commit (as Simon pointed out downthread). I'm unsure about what's better, the full list (giving a good overview, but more to configure) or the single sum GUC (being very flexible and closer to how things work internally). But that seems to be a UI question exclusively. Regarding the "wait forever" option: I don't think continuing is a viable alternative, as it silently ignores the requested level of persistence. The only alternative I can see is to abort with an error. As far as comparison is allowed, that's what Postgres-R currently does if there's no majority of nodes. It allows to emit an error message and helpful hints, as opposed to letting the admin figure out what and where it's hanging. Not throwing false errors has the same requirements as "waiting forever", so that's an orthogonal issue, IMO. Regards Markus Wanner
On Mon, 2010-09-20 at 09:27 +0300, Heikki Linnakangas wrote: > On 18/09/10 22:59, Robert Haas wrote: > > On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs<simon@2ndquadrant.com> wrote: > >> Waiting might sound attractive. In practice, waiting will make all of > >> your connections lock up and it will look to users as if their master > >> has stopped working as well. (It has!). I can't imagine why anyone would > >> ever want an option to select that; its the opposite of high > >> availability. Just sounds like a serious footgun. > > > > Nevertheless, it seems that some people do want exactly that behavior, > > no matter how crazy it may seem to you. > > Yeah, I agree with both of you. I have a hard time imaging a situation > where you would actually want that. It's not high availability, it's > high durability. When a transaction is acknowledged as committed, you > know it's never ever going to disappear even if a meteor strikes the > current master server within the next 10 milliseconds. In practice, > people want high availability instead. > > That said, the timeout option also feels a bit wishy-washy to me. With a > timeout, acknowledgment of a commit means "your transaction is safely > committed in the master and slave. Or not, if there was some glitch with > the slave". That doesn't seem like a very useful guarantee; if you're > happy with that why not just use async replication? > > However, the "wait forever" behavior becomes useful if you have a > monitoring application outside the DB that decides when enough is enough > and tells the DB that the slave can be considered dead. So "wait > forever" actually means "wait until I tell you that you can give up". > The monitoring application can STONITH to ensure that the slave stays > down, before letting the master proceed with the commit. err... what is the difference between a timeout and stonith? None. We still proceed without the slave in both cases after the decision point. In all cases, we would clearly have a user accessible function to stop particular sessions, or all sessions, from waiting for standby to return. You would have 3 choices: * set automatic timeout * set wait forever and then wait for manual resolution * set wait forever and then trust to external clusterware Many people have asked for timeouts and I agree it's probably the easiest thing to do if you just have 1 standby. > With that in mind, we have to make sure that a transaction that's > waiting for acknowledgment of the commit from a slave is woken up if the > configuration changes. There's a misunderstanding here of what I've said and its a subtle one. My patch supports a timeout of 0, i.e. wait forever. Which means I agree that functionality is desired and should be included. This operates by saying that if a currently-connected-standby goes down we will wait until the timeout. So I agree all 3 choices should be available to users. Discussion has been about what happens to ought-to-have-been-connected standbys. Heikki had argued we need standby registration because if a server *ought* to have been there, yet isn't currently there when we wait for sync rep, we would still wait forever for it to return. To do this you require standby registration. But there is a hidden issue there: If you care about high availability AND sync rep you have two standbys. If one goes down, the other is still there. In general, if you want high availability on N servers then you have N+1 standbys. If one goes down, the other standbys provide the required level of durability and we do not wait. So the only case where standby registration is required is where you deliberately choose to *not* have N+1 redundancy and then yet still require all N standbys to acknowledge. That is a suicidal config and nobody would sanely choose that. It's not a large or useful use case for standby reg. (But it does raise the question again of whether we need quorum commit). My take is that if the above use case occurs it is because one standby has just gone down and the standby is, for a hopefully short period, in a degraded state and that the service responds to that. So in my proposal, if a standby is not there *now* we don't wait for it. Which cuts out a huge bag of code, specification and such like that isn't required to support sane use cases. More stuff to get wrong and regret in later releases. The KISS principle, just like we apply in all other cases. If we did have standby registration, then I would implement it in a table, not in an external config file. That way when we performed a failover the data would be accessible on the new master. But I don't suggest we have CREATE/ALTER STANDBY syntax. We already have CREATE/ALTER SERVER if we wanted to do it in SQL. If we did that, ISTM we should choose functions. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 20/09/10 12:17, Simon Riggs wrote: > err... what is the difference between a timeout and stonith? STONITH ("Shoot The Other Node In The Head") means that the other node is somehow disabled so that it won't unexpectedly come back alive. A timeout means that the slave hasn't been seen for a while, but it might reconnect just after the timeout has expired. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote: > On 20/09/10 12:17, Simon Riggs wrote: > > err... what is the difference between a timeout and stonith? > > STONITH ("Shoot The Other Node In The Head") means that the other node > is somehow disabled so that it won't unexpectedly come back alive. A > timeout means that the slave hasn't been seen for a while, but it might > reconnect just after the timeout has expired. You've edited my reply to change the meaning of what was a rhetorical question, as well as completely ignoring the main point of my reply. Please respond to the main point: Following some thought and analysis, AFAICS there is no sensible use case that requires standby registration. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 20/09/10 15:50, Simon Riggs wrote: > On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote: >> On 20/09/10 12:17, Simon Riggs wrote: >>> err... what is the difference between a timeout and stonith? >> >> STONITH ("Shoot The Other Node In The Head") means that the other node >> is somehow disabled so that it won't unexpectedly come back alive. A >> timeout means that the slave hasn't been seen for a while, but it might >> reconnect just after the timeout has expired. > > You've edited my reply to change the meaning of what was a rhetorical > question, as well as completely ignoring the main point of my reply. > > Please respond to the main point: Following some thought and analysis, > AFAICS there is no sensible use case that requires standby registration. Ok, I had completely missed your point then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Sep 20, 2010 at 8:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Please respond to the main point: Following some thought and analysis, > AFAICS there is no sensible use case that requires standby registration. I disagree. You keep analyzing away the cases that require standby registration, but I don't believe that they're not real. Aidan Van Dyk's case upthread of wanting to make sure that the standby is up and replicating synchronously before the master starts processing transactions seems perfectly legitimate to me. Sure, it's paranoid, but so what? We're all about paranoia, at least as far as data loss is concerned. So the "wait forever" case is, in my opinion, sufficient to demonstrate that we need it, but it's not even my primary reason for wanting to have it. The most important reason why I think we should have standby registration is for simplicity of configuration. Yes, it adds another configuration file, but that configuration file contains ALL of the information about which standbys are synchronous. Without standby registration, this information will inevitably be split between the master config and the various slave configs and you'll have to look at all the configurations to be certain you understand how it's going to end up working. As a particular manifestation of this, and as previously argued and +1'd upthread, the ability to change the set of standbys to which the master is replicating synchronously without changing the configuration on the master or any of the existing slaves seems seems dangerous. Another reason why I think we should have standby registration is to allow eventually allow the "streaming WAL backwards" configuration which has previously been discussed. IOW, you could stream the WAL to the slave in advance of fsync-ing it on the master. After a power failure, the machines in the cluster can talk to each other and figure out which one has the furthest-advanced WAL pointer and stream from that machine to all the others. This is an appealing configuration for people using sync rep because it would allow the fsyncs to be done in parallel rather than sequentially as is currently necessary - but if you're using it, you're certainly not going to want the master to enter normal running without waiting to hear from the slave. Just to be clear, that is a list of three independent reasons any one of which I think is sufficient for wanting standby registration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, I'm somewhat sorry to have to play this game, as I sure don't feel smarter by composing this email. Quite the contrary. Robert Haas <robertmhaas@gmail.com> writes: > So the "wait forever" case is, in my opinion, > sufficient to demonstrate that we need it, but it's not even my > primary reason for wanting to have it. You're talking about standby registration on the master. You can solve this case without it, because when a slave is not connected it's not giving any feedback (vote, weight, ack) to the master. All you have to do is have the quorum setup in a way that disconnecting your slave means you can't reach the quorum any more. Have it SIGHUP and you can even choose to fix the setup, rather than fix the standby. So no need for registration here, it's just another way to solve the problem. Not saying it's better or worse, just another. Now we could have a summary function on the master showing all the known slaves, their last time of activity, their known current setup, etc, all from the master, but read-only. Would that be useful enough? > The most important reason why I think we should have standby > registration is for simplicity of configuration. Yes, it adds another > configuration file, but that configuration file contains ALL of the > information about which standbys are synchronous. Without standby > registration, this information will inevitably be split between the > master config and the various slave configs and you'll have to look at > all the configurations to be certain you understand how it's going to > end up working. So, here, we have two quite different things to be concerned about. First is the configuration, and I say that managing a distributed setup will be easier for the DBA. Then there's how to obtain a nice view about the distributed system, which again we can achieve from the master without manually registering the standbys. After all, the information you want needs to be there. > As a particular manifestation of this, and as > previously argued and +1'd upthread, the ability to change the set of > standbys to which the master is replicating synchronously without > changing the configuration on the master or any of the existing slaves > seems seems dangerous. Well, you still need to open the HBA for the new standby to be able to connect, and to somehow take a base backup, right? We're not exactly transparent there, yet, are we? > Another reason why I think we should have standby registration is to > allow eventually allow the "streaming WAL backwards" configuration > which has previously been discussed. IOW, you could stream the WAL to > the slave in advance of fsync-ing it on the master. After a power > failure, the machines in the cluster can talk to each other and figure > out which one has the furthest-advanced WAL pointer and stream from > that machine to all the others. This is an appealing configuration > for people using sync rep because it would allow the fsyncs to be done > in parallel rather than sequentially as is currently necessary - but > if you're using it, you're certainly not going to want the master to > enter normal running without waiting to hear from the slave. I love the idea. Now it seems to me that all you need here is the master sending one more information with each WAL "segment", the currently fsync'ed position, which pre-9.1 is implied as being the current LSN from the stream, right? Here I'm not sure to follow you in details, but it seems to me registering the standbys is just another way of achieving the same. To be honest, I don't understand a bit how it helps implement your idea. Regards, -- Dimitri Fontaine PostgreSQL DBA, Architecte
On Mon, Sep 20, 2010 at 4:10 PM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So the "wait forever" case is, in my opinion, >> sufficient to demonstrate that we need it, but it's not even my >> primary reason for wanting to have it. > > You're talking about standby registration on the master. You can solve > this case without it, because when a slave is not connected it's not > giving any feedback (vote, weight, ack) to the master. All you have to > do is have the quorum setup in a way that disconnecting your slave means > you can't reach the quorum any more. Have it SIGHUP and you can even > choose to fix the setup, rather than fix the standby. I suppose that could work. >> The most important reason why I think we should have standby >> registration is for simplicity of configuration. Yes, it adds another >> configuration file, but that configuration file contains ALL of the >> information about which standbys are synchronous. Without standby >> registration, this information will inevitably be split between the >> master config and the various slave configs and you'll have to look at >> all the configurations to be certain you understand how it's going to >> end up working. > > So, here, we have two quite different things to be concerned > about. First is the configuration, and I say that managing a distributed > setup will be easier for the DBA. Yeah, I disagree with that, but I suppose it's a question of opinion. > Then there's how to obtain a nice view about the distributed system, > which again we can achieve from the master without manually registering > the standbys. After all, the information you want needs to be there. I think that without standby registration it will be tricky to display information like "the last time that standby foo was connected". Yeah, you could set a standby name on the standby server and just have the master remember details for every standby name it's ever seen, but then how do you prune the list? Heikki mentioned another application for having a list of the current standbys only (rather than "every standby that has ever existed") upthread: you can compute the exact amount of WAL you need to keep around. >> As a particular manifestation of this, and as >> previously argued and +1'd upthread, the ability to change the set of >> standbys to which the master is replicating synchronously without >> changing the configuration on the master or any of the existing slaves >> seems seems dangerous. > > Well, you still need to open the HBA for the new standby to be able to > connect, and to somehow take a base backup, right? We're not exactly > transparent there, yet, are we? Sure, but you might have that set relatively open on a trusted network. >> Another reason why I think we should have standby registration is to >> allow eventually allow the "streaming WAL backwards" configuration >> which has previously been discussed. IOW, you could stream the WAL to >> the slave in advance of fsync-ing it on the master. After a power >> failure, the machines in the cluster can talk to each other and figure >> out which one has the furthest-advanced WAL pointer and stream from >> that machine to all the others. This is an appealing configuration >> for people using sync rep because it would allow the fsyncs to be done >> in parallel rather than sequentially as is currently necessary - but >> if you're using it, you're certainly not going to want the master to >> enter normal running without waiting to hear from the slave. > > I love the idea. > > Now it seems to me that all you need here is the master sending one more > information with each WAL "segment", the currently fsync'ed position, > which pre-9.1 is implied as being the current LSN from the stream, > right? I don't see how that would help you. > Here I'm not sure to follow you in details, but it seems to me > registering the standbys is just another way of achieving the same. To > be honest, I don't understand a bit how it helps implement your idea. Well, if you need to talk to "all the other standbys" and see who has the furtherst-advanced xlog pointer, it seems like you have to have a list somewhere of who they all are. Maybe there's some way to get this to work without standby registration, but I don't really understand the resistance to the idea, and I fear it's going to do nothing good for our reputation for ease of use (or lack thereof). The idea of making this all work without standby registration strikes me as akin to the notion of having someone decide whether they're running a three-legged race by checking whether their leg is currently tied to someone else's leg. You can probably make that work by patching around the various failure cases, but why isn't simpler to just tell the poor guy "Hi, Joe. You're running a three-legged race with Jane today. Hans and Juanita will be following you across the field, too, but don't worry about whether they're keeping up."? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote: > Well, if you need to talk to "all the other standbys" and see who has > the furtherst-advanced xlog pointer, it seems like you have to have a > list somewhere of who they all are. When they connect to the master to get the stream, don't they in effect, already talk to the primary with the XLogRecPtr being relayed?Can the connection IP, port, XLogRecPtr and requesttime of the standby be stored from this communication to track the states of each standby? They would in effect be registering upon WAL stream request... and no doubt this is a horrifically naive view of how it works. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On Mon, Sep 20, 2010 at 5:42 PM, Thom Brown <thom@linux.com> wrote: > On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, if you need to talk to "all the other standbys" and see who has >> the furtherst-advanced xlog pointer, it seems like you have to have a >> list somewhere of who they all are. > > When they connect to the master to get the stream, don't they in > effect, already talk to the primary with the XLogRecPtr being relayed? > Can the connection IP, port, XLogRecPtr and request time of the > standby be stored from this communication to track the states of each > standby? They would in effect be registering upon WAL stream > request... and no doubt this is a horrifically naive view of how it > works. Sure, but the point is that we can want DISCONNECTED slaves to affect master behavior in a variety of ways (master retains WAL for when they reconnect, master waits for them to connect before acking commits, master shuts down if they're not there, master tries to stream WAL backwards from them before entering normal running). I just work here, but it seems to me that such things will be easier if the master has an explicit notion of what's out there. Can we make it all work without that? Possibly, but I think it will be harder to understand. With standby registration, you can DECLARE the behavior you want. You can tell the master "replicate synchronously to Bob". And that's it. Without standby registration, what's being proposed is basically that you can tell the master "replicate synchronously to one server" and you can tell Bob "you are a server to which the master can replicate synchronously" and you can tell the other servers "you are not a server to which Bob can replicate synchronously". That works, but to me it seems less straightforward. And that's actually a relatively simple example. Suppose you want to tell the master "keep enough WAL for Bob to catch up when he reconnects, but if he gets more than 1GB behind, forget about him". I'm sure someone can devise a way of making that work without standby registration, too, but I'm not too sure off the top of my head what it will be. With standby registration, you can just write something like this in standbys.conf (syntax invented): [bob] wal_keep_segments=64 I feel like that's really nice and simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, 2010-09-20 at 22:42 +0100, Thom Brown wrote: > On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote: > > Well, if you need to talk to "all the other standbys" and see who has > > the furtherst-advanced xlog pointer, it seems like you have to have a > > list somewhere of who they all are. > > When they connect to the master to get the stream, don't they in > effect, already talk to the primary with the XLogRecPtr being relayed? > Can the connection IP, port, XLogRecPtr and request time of the > standby be stored from this communication to track the states of each > standby? They would in effect be registering upon WAL stream > request... and no doubt this is a horrifically naive view of how it > works. It's not viable to record information at the chunk level in that way. But the overall idea is fine. We can track who was connected and how to access their LSNs. They don't need to be registered ahead of time on the master to do that. They can register and deregister each time they connect. This discussion is reminiscent of the discussion we had when Fujii first suggested that the standby should connect to the master. At first I though "don't be stupid, the master needs to connect to the standby!". It stood everything I had thought about on its head and that hurt, but there was no logical reason to oppose. We could have used standby registration on the master to handle that, but we didn't. I'm happy that we have a more flexible system as a result. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: >>> What synchronization level does each combination of sync_replication >>> and sync_replication_service lead to? >> >> There are only 4 possible outcomes. There is no combination, so we don't >> need a table like that above. >> >> The "service" specifies the highest request type available from that >> specific standby. If someone requests a higher service than is currently >> offered by this standby, they will either >> a) get that service from another standby that does offer that level >> b) automatically downgrade the sync rep mode to the highest available. > > I like the a) part, I can't say the same about the b) part. There's no > reason to accept to COMMIT a transaction when the requested durability > is known not to have been reached, unless the user said so. Yep, I can imagine that some people want to ensure that *all* the transactions are synchronously replicated to the synchronous standby, without regard to sync_replication. So I'm not sure if automatic downgrade/upgrade of the mode makes sense. We should introduce new parameter specifying whether to allow automatic degrade/upgrade or not? It seems complicated though. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Robert Haas <robertmhaas@gmail.com> writes: >> So, here, we have two quite different things to be concerned >> about. First is the configuration, and I say that managing a distributed >> setup will be easier for the DBA. > > Yeah, I disagree with that, but I suppose it's a question of opinion. I'd be willing to share your thoughts if it was only for the initial setup. This one is hard enough to sketch on the paper that you prefer an easy way to implement it afterwards, and in some cases a central setup would be just that. The problem is that I'm concerned with upgrading the setup once the system is live. Not at the best time for that in the project, either, but when you finally get the budget to expand the number of servers. From experience with skytools, no manual registering works best. But… > I think that without standby registration it will be tricky to display > information like "the last time that standby foo was connected". > Yeah, you could set a standby name on the standby server and just have > the master remember details for every standby name it's ever seen, but > then how do you prune the list? … I now realize there are 2 parts under the registration bit. What I don't see helping is manual registration. For some use cases you're talking about maintaining a list of known servers sounds important, and that's also what londiste is doing. Pruning the list would be done with some admin function. You need one to see the current state already, add some other one to unregister a known standby. In londiste, that's how it works, and events are kept in the queues for all known subscribers. For the ones that won't ever connect again, that's of course a problem, so you SELECT pgq.unregister_consumer(…);. > Heikki mentioned another application for having a list of the current > standbys only (rather than "every standby that has ever existed") > upthread: you can compute the exact amount of WAL you need to keep > around. Well, either way, the system can not decide on its own whether a currently not available standby is going to join the party again later on. >> Now it seems to me that all you need here is the master sending one more >> information with each WAL "segment", the currently fsync'ed position, >> which pre-9.1 is implied as being the current LSN from the stream, >> right? > > I don't see how that would help you. I think you want to refrain to apply any WAL segment you receive at the standby and instead only advance as far as the master is known to have reached. And you want this information to be safe against slave restart, too: don't replay any WAL you have in pg_xlog or in the archive. The other part of your proposal is another story (having slaves talk to each-other at master crash). > Well, if you need to talk to "all the other standbys" and see who has > the furtherst-advanced xlog pointer, it seems like you have to have a > list somewhere of who they all are. Ah sorry I was thinking on the other part of the proposal only (sending WAL segments that are not been fsync'ed yet on the master). So, yes. But I thought you were saying that replicating a (shared?) catalog of standbys is technically hard (or impossible), so how would you go about it? As it's all about making things simpler for the users, you're not saying that they should keep the main setup in sync manually on all the standbys servers, right? > Maybe there's some way to get > this to work without standby registration, but I don't really > understand the resistance to the idea In fact I'm now realising what I don't like is having to manually do the registration work: as I already have to setup the slaves, it only appears like a useless burden on me, giving information the system already has. Automatic registration I'm fine with, I now realize. Regards, -- Dimitri Fontaine PostgreSQL DBA, Architecte
On Sun, Sep 19, 2010 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote: >> There are considerable benefits to having a standby registry with a >> table-like interface. Particularly, one where we could change >> replication via UPDATE (or ALTER STANDBY) statements. > > I think that using a system catalog for this is going to be a > non-starter, but we could use a flat file that is designed to be > machine-editable (and thus avoid repeating the mistake we've made with > postgresql.conf). Yep, the standby registration information should be accessible and changable while the server is not running. So using only system catalog is not an answer. My patch has implemented standbys.conf which was proposed before. This format is the almost same as the pg_hba.conf. Is this machine-editable, you think? If not, we should the format to something like xml? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 21 September 2010 09:29, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Sep 19, 2010 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> There are considerable benefits to having a standby registry with a >>> table-like interface. Particularly, one where we could change >>> replication via UPDATE (or ALTER STANDBY) statements. >> >> I think that using a system catalog for this is going to be a >> non-starter, but we could use a flat file that is designed to be >> machine-editable (and thus avoid repeating the mistake we've made with >> postgresql.conf). > > Yep, the standby registration information should be accessible and > changable while the server is not running. So using only system > catalog is not an answer. > > My patch has implemented standbys.conf which was proposed before. > This format is the almost same as the pg_hba.conf. Is this > machine-editable, you think? If not, we should the format to > something like xml? I really don't think an XML config would improve anything. In fact it would just introduce more ways to break the config by the mere fact it has to be well-formed. I'd be in favour of one similar to pg_hba.conf, because then, at least, we'd still only have 2 formats of configuration. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On Tue, Sep 21, 2010 at 9:34 AM, Thom Brown <thom@linux.com> wrote: > I really don't think an XML config would improve anything. In fact it > would just introduce more ways to break the config by the mere fact it > has to be well-formed. I'd be in favour of one similar to > pg_hba.conf, because then, at least, we'd still only have 2 formats of > configuration. Want to spend a few days hacking on a config editor for pgAdmin, and then re-evaluate that comment? :-) -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On 21 September 2010 09:37, Dave Page <dpage@pgadmin.org> wrote: > On Tue, Sep 21, 2010 at 9:34 AM, Thom Brown <thom@linux.com> wrote: >> I really don't think an XML config would improve anything. In fact it >> would just introduce more ways to break the config by the mere fact it >> has to be well-formed. I'd be in favour of one similar to >> pg_hba.conf, because then, at least, we'd still only have 2 formats of >> configuration. > > Want to spend a few days hacking on a config editor for pgAdmin, and > then re-evaluate that comment? It would be quicker to add in support for a config format we don't use yet than to duplicate support for a new config in the same format as an existing one? Plus it's a compromise between user-screw-up-ability and machine-readability. My fear would be standby.conf would be edited by users who don't really know XML and then we'd have 3 different styles of config to tell the user to edit. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On Mon, Sep 20, 2010 at 3:27 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > However, the "wait forever" behavior becomes useful if you have a monitoring > application outside the DB that decides when enough is enough and tells the > DB that the slave can be considered dead. So "wait forever" actually means > "wait until I tell you that you can give up". The monitoring application can > STONITH to ensure that the slave stays down, before letting the master > proceed with the commit. This is also useful for preventing a failover from causing some data loss by promoting the lagged standby to the master. To avoid any data loss, we must STONITH the standby before any transactions resume on the master, when replication connection is terminated or the crash of the standby happens. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 21/09/10 11:52, Thom Brown wrote: > My fear would be standby.conf would be edited by users who don't > really know XML and then we'd have 3 different styles of config to > tell the user to edit. I'm not a big fan of XML either. That said, the format could use some hierarchy. If we add many more per-server options, one server per line will quickly become unreadable. Perhaps something like the ini-file syntax Robert Haas just made up elsewhere in this thread: ------- globaloption1 = value [servername1] synchronization_level = async option1 = value [servername2] synchronization_level = replay option2 = value1 ------- I'm not sure I like the ini-file style much, but the two-level structure it provides seems like a perfect match. Then again, maybe we should go with something like json or yaml that would allow deeper hierarchies for the sake of future expandability. Oh, and there Dimitri's idea of "service levels" for per-transaction control (http://archives.postgresql.org/message-id/m2sk1868hb.fsf@hi-media.com): > sync_rep_services = {critical: recv=2, fsync=2, replay=1; > important: fsync=3; > reporting: recv=2, apply=1} We'll need to accommodate something like that too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 21/09/10 11:52, Thom Brown wrote: >> My fear would be standby.conf would be edited by users who don't >> really know XML and then we'd have 3 different styles of config to >> tell the user to edit. > I'm not a big fan of XML either. > ... > Then again, maybe we should go with something like json or yaml The fundamental problem with all those "machine editable" formats is that they aren't "people editable". If you have to have a tool (other than a text editor) to change a config file, you're going to be very unhappy when things are broken at 3AM and you're trying to fix it while ssh'd in from your phone. I think the "ini file" format suggestion is probably a good one; it seems to fit this problem, and it's something that people are used to. We could probably shoehorn the info into a pg_hba-like format, but I'm concerned about whether we'd be pushing that format beyond what it can reasonably handle. regards, tom lane
On Tue, Sep 21, 2010 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> On 21/09/10 11:52, Thom Brown wrote: >>> My fear would be standby.conf would be edited by users who don't >>> really know XML and then we'd have 3 different styles of config to >>> tell the user to edit. > >> I'm not a big fan of XML either. >> ... >> Then again, maybe we should go with something like json or yaml > > The fundamental problem with all those "machine editable" formats is > that they aren't "people editable". If you have to have a tool (other > than a text editor) to change a config file, you're going to be very > unhappy when things are broken at 3AM and you're trying to fix it > while ssh'd in from your phone. Agreed. Although, if things are broken at 3AM and I'm trying to fix it while ssh'd in from my phone, I reserve the right to be VERY unhappy no matter what format the file is in. :-) > I think the "ini file" format suggestion is probably a good one; it > seems to fit this problem, and it's something that people are used to. > We could probably shoehorn the info into a pg_hba-like format, but > I'm concerned about whether we'd be pushing that format beyond what > it can reasonably handle. It's not clear how many attributes we'll want to associate with a server. Simon seems to think we can keep it to zero; I think it's positive but I can't say for sure how many there will eventually be. It may also be that a lot of the values will be optional things that are frequently left unspecified. Both of those make me think that a columnar format is probably not best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-09-21 at 16:58 +0900, Fujii Masao wrote: > On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine > <dfontaine@hi-media.com> wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > >> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote: > >>> What synchronization level does each combination of sync_replication > >>> and sync_replication_service lead to? > >> > >> There are only 4 possible outcomes. There is no combination, so we don't > >> need a table like that above. > >> > >> The "service" specifies the highest request type available from that > >> specific standby. If someone requests a higher service than is currently > >> offered by this standby, they will either > >> a) get that service from another standby that does offer that level > >> b) automatically downgrade the sync rep mode to the highest available. > > > > I like the a) part, I can't say the same about the b) part. There's no > > reason to accept to COMMIT a transaction when the requested durability > > is known not to have been reached, unless the user said so. Hmm, no reason? The reason is that the alternative is that the session would hang until a standby arrived that offered that level of service. Why would you want that behaviour? Would you really request that option? > Yep, I can imagine that some people want to ensure that *all* the > transactions are synchronously replicated to the synchronous standby, > without regard to sync_replication. So I'm not sure if automatic > downgrade/upgrade of the mode makes sense. We should introduce new > parameter specifying whether to allow automatic degrade/upgrade or not? > It seems complicated though. I agree, but I'm not against any additional parameter if people say they really want them *after* the consequences of those choices have been highlighted. IMHO we should focus on the parameters that deliver key use cases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Robert Haas wrote: > On Tue, Sep 21, 2010 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> On 21/09/10 11:52, Thom Brown wrote: > >>> My fear would be standby.conf would be edited by users who don't > >>> really know XML and then we'd have 3 different styles of config to > >>> tell the user to edit. > > > >> I'm not a big fan of XML either. > >> ... > >> Then again, maybe we should go with something like json or yaml > > > > The fundamental problem with all those "machine editable" formats is > > that they aren't "people editable". ?If you have to have a tool (other > > than a text editor) to change a config file, you're going to be very > > unhappy when things are broken at 3AM and you're trying to fix it > > while ssh'd in from your phone. > > Agreed. Although, if things are broken at 3AM and I'm trying to fix > it while ssh'd in from my phone, I reserve the right to be VERY > unhappy no matter what format the file is in. :-) > > > I think the "ini file" format suggestion is probably a good one; it > > seems to fit this problem, and it's something that people are used to. > > We could probably shoehorn the info into a pg_hba-like format, but > > I'm concerned about whether we'd be pushing that format beyond what > > it can reasonably handle. > > It's not clear how many attributes we'll want to associate with a > server. Simon seems to think we can keep it to zero; I think it's > positive but I can't say for sure how many there will eventually be. > It may also be that a lot of the values will be optional things that > are frequently left unspecified. Both of those make me think that a > columnar format is probably not best. Crazy idea, but could we use format like postgresql.conf by extending postgresql.conf syntax, e.g.: server1.failover = falseserver1.keep_connect = true -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> That said, the timeout option also feels a bit wishy-washy to me. With a > timeout, acknowledgment of a commit means "your transaction is safely > committed in the master and slave. Or not, if there was some glitch with > the slave". That doesn't seem like a very useful guarantee; if you're > happy with that why not just use async replication? Ah, I wasn't clear. My thought was that a standby which exceeds the timeout would be marked as "nonresponsive" and no longer included in the list of standbys which needed to be synchronized. That is, the timeout would be a timeout which says "this standby is down". > So the only case where standby registration is required is where you > deliberately choose to *not* have N+1 redundancy and then yet still > require all N standbys to acknowledge. That is a suicidal config and > nobody would sanely choose that. It's not a large or useful use case for > standby reg. (But it does raise the question again of whether we need > quorum commit). Thinking of this as a sysadmin, what I want is to have *one place* I can go an troubleshoot my standby setup. If I have 12 synch standbys and they're creating too much load on the master, and I want to change half of them to async, I don't want to have to ssh into 6 different machines to do so. If one standby needs to be taken out of the network because it's too slow, I want to be able to log in to the master and instantly identify which standby is lagging and remove it there. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
> Crazy idea, but could we use format like postgresql.conf by extending > postgresql.conf syntax, e.g.: > > server1.failover = false > server1.keep_connect = true > Why is this in the config file at all. It should be: synchronous_replication = TRUE/FALSE then ALTER CLUSTER ENABLE REPLICATION FOR FOO; ALTER CLUSTER SET keep_connect ON FOO TO TRUE; Or some such thing. Sincerely, Joshua D. Drake > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On 22/09/10 03:25, Joshua D. Drake wrote: > Why is this in the config file at all. It should be: > > synchronous_replication = TRUE/FALSE Umm, what does this do? > then > > ALTER CLUSTER ENABLE REPLICATION FOR FOO; > ALTER CLUSTER SET keep_connect ON FOO TO TRUE; > > Or some such thing. I like a configuration file more because you can easily add comments, comment out lines, etc. It also makes it easier to have a different configuration in master and standby. We don't support cascading slaves, yet, but you might still want a different configuration in master and slave, waiting for the moment that the slave is promoted to a new master. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 21/09/10 18:12, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 21/09/10 11:52, Thom Brown wrote: >>> My fear would be standby.conf would be edited by users who don't >>> really know XML and then we'd have 3 different styles of config to >>> tell the user to edit. > >> I'm not a big fan of XML either. >> ... >> Then again, maybe we should go with something like json or yaml > > The fundamental problem with all those "machine editable" formats is > that they aren't "people editable". If you have to have a tool (other > than a text editor) to change a config file, you're going to be very > unhappy when things are broken at 3AM and you're trying to fix it > while ssh'd in from your phone. I'm not very familiar with any of those formats, but I agree it needs to be easy to edit by hand first and foremost. > I think the "ini file" format suggestion is probably a good one; it > seems to fit this problem, and it's something that people are used to. > We could probably shoehorn the info into a pg_hba-like format, but > I'm concerned about whether we'd be pushing that format beyond what > it can reasonably handle. The ini file format seems to be enough for the features proposed this far, but I'm a bit concerned that even that might not be flexible enough for future features. I guess we'll cross the bridge when we get there and go with an ini file for now. It should be possible to extend it in various ways, and in the worst case that we have to change to a completely different format, we can provide a how to guide on converting existing config files to the new format. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On 09/21/2010 08:05 PM, Simon Riggs wrote: > Hmm, no reason? The reason is that the alternative is that the session > would hang until a standby arrived that offered that level of service. > Why would you want that behaviour? Would you really request that option? I think I now agree with Simon on that point. It's only an issue in multi-master replication, where continued operation would lead to a split-brain situation. With master-slave, you only need to make sure your master stays the master even if the standby crash(es) are followed by a master crash. If your cluster-ware is too clever and tries a fail-over on a slave that's quicker to come up, you get the same split-brain situation. Put another way: if you let your master continue, don't ever try a fail-over after a full-cluster crash. Regards Markus Wanner
<br /><br /> On 09/22/2010 04:18 AM, Heikki Linnakangas wrote: <blockquote cite="mid:4C99BBED.1030109@enterprisedb.com" type="cite">On21/09/10 18:12, Tom Lane wrote: <br /><blockquote type="cite">Heikki Linnakangas<a class="moz-txt-link-rfc2396E" href="mailto:heikki.linnakangas@enterprisedb.com"><heikki.linnakangas@enterprisedb.com></a> writes: <br /><blockquotetype="cite">On 21/09/10 11:52, Thom Brown wrote: <br /><blockquote type="cite">My fear would be standby.confwould be edited by users who don't <br /> really know XML and then we'd have 3 different styles of config to<br /> tell the user to edit. <br /></blockquote></blockquote><br /><blockquote type="cite">I'm not a big fan of XML either.<br /> ... <br /> Then again, maybe we should go with something like json or yaml <br /></blockquote><br /> The fundamentalproblem with all those "machine editable" formats is <br /> that they aren't "people editable". If you have tohave a tool (other <br /> than a text editor) to change a config file, you're going to be very <br /> unhappy when thingsare broken at 3AM and you're trying to fix it <br /> while ssh'd in from your phone. <br /></blockquote><br /> I'mnot very familiar with any of those formats, but I agree it needs to be easy to edit by hand first and foremost. <br /><br/><blockquote type="cite">I think the "ini file" format suggestion is probably a good one; it <br /> seems to fit thisproblem, and it's something that people are used to. <br /> We could probably shoehorn the info into a pg_hba-like format,but <br /> I'm concerned about whether we'd be pushing that format beyond what <br /> it can reasonably handle. <br/></blockquote><br /> The ini file format seems to be enough for the features proposed this far, but I'm a bit concernedthat even that might not be flexible enough for future features. I guess we'll cross the bridge when we get thereand go with an ini file for now. It should be possible to extend it in various ways, and in the worst case that we haveto change to a completely different format, we can provide a how to guide on converting existing config files to thenew format. <br /></blockquote><br /> The ini file format is not flexible enough, IMNSHO. If we're going to adopt a newconfig file format it should have these characteristics, among others:<br /><ul><li>well known (let's not invent a newone)<li>supports hierarchical structure<li>reasonably readable</ul> I realize that the last is very subjective. Personally,I'm very comfortable with XML, but then I do a *lot* of work with it, and have for many years. I know I'm in aminority on that, and some people just go bananas when they see it. Since we're just about to add a JSON parser to the backend,by the look of it, that looks like a reasonable bet. Maybe it uses a few too many quotes, but that's not really sohard to get your head around, even if it offends you a bit aesthetically. And it is certainly fairly widely known.<br /><br/> cheers<br /><br /> andrew<br />
On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > The ini file format is not flexible enough, IMNSHO. If we're going to adopt > a new config file format it should have these characteristics, among others: > > well known (let's not invent a new one) > supports hierarchical structure > reasonably readable The ini format meets all of those requirements - and it's certainly far more readable/editable than XML and friends. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/22/2010 04:54 AM, Dave Page wrote: > On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net> wrote: >> The ini file format is not flexible enough, IMNSHO. If we're going to adopt >> a new config file format it should have these characteristics, among others: >> >> well known (let's not invent a new one) >> supports hierarchical structure >> reasonably readable > The ini format meets all of those requirements - and it's certainly > far more readable/editable than XML and friends. > No, it's really not hierarchical. It only has goes one level deep. cheers andrew
On Wed, Sep 22, 2010 at 12:07 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 09/22/2010 04:54 AM, Dave Page wrote: >> >> On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net> >> wrote: >>> >>> The ini file format is not flexible enough, IMNSHO. If we're going to >>> adopt >>> a new config file format it should have these characteristics, among >>> others: >>> >>> well known (let's not invent a new one) >>> supports hierarchical structure >>> reasonably readable >> >> The ini format meets all of those requirements - and it's certainly >> far more readable/editable than XML and friends. >> > > No, it's really not hierarchical. It only has goes one level deep. I guess pgAdmin/wxWidgets are broken then :-) [Servers] Count=5 [Servers/1] Server=localhost Description=PostgreSQL 8.3 ServiceID= DiscoveryID=/PostgreSQL/8.3 Port=5432 StorePwd=true Restore=false Database=postgres Username=postgres LastDatabase=postgres LastSchema=public DbRestriction= Colour=#FFFFFF SSL=0 Group=PPAS Rolename= [Servers/1/Databases] [Servers/1/Databases/postgres] SchemaRestriction= [Servers/1/Databases/pphq] SchemaRestriction= [Servers/1/Databases/template_postgis] SchemaRestriction= [Servers/2] ... ... -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote: > > No, it's really not hierarchical. It only has goes one level deep. > > I guess pgAdmin/wxWidgets are broken then :-) > > [Servers] > Count=5 > [Servers/1] > Server=localhost Well, by that logic, even what we have now for postgresql.conf is hierarchical. I think the criterion was rather meant to be - can represent hierarchies without repeating intermediate node names (Note: no opinion on which format is better for the task at hand)
On 09/22/2010 07:20 AM, Dave Page wrote: > On Wed, Sep 22, 2010 at 12:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >> >> On 09/22/2010 04:54 AM, Dave Page wrote: >>> On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net> >>> wrote: >>>> The ini file format is not flexible enough, IMNSHO. If we're going to >>>> adopt >>>> a new config file format it should have these characteristics, among >>>> others: >>>> >>>> well known (let's not invent a new one) >>>> supports hierarchical structure >>>> reasonably readable >>> The ini format meets all of those requirements - and it's certainly >>> far more readable/editable than XML and friends. >>> >> No, it's really not hierarchical. It only has goes one level deep. > I guess pgAdmin/wxWidgets are broken then :-) > > [Servers] > Count=5 > [Servers/1] > Server=localhost > Description=PostgreSQL 8.3 > ServiceID= > DiscoveryID=/PostgreSQL/8.3 > Port=5432 > StorePwd=true > Restore=false > Database=postgres > Username=postgres > LastDatabase=postgres > LastSchema=public > DbRestriction= > Colour=#FFFFFF > SSL=0 > Group=PPAS > Rolename= > [Servers/1/Databases] > [Servers/1/Databases/postgres] > SchemaRestriction= > [Servers/1/Databases/pphq] > SchemaRestriction= > [Servers/1/Databases/template_postgis] > SchemaRestriction= > [Servers/2] > ... > ... Well, that's not what I'd call a hierarchy, in any sane sense. I've often had to dig all over the place in ini files to find related bits of information in disparate parts of the file. Compared to a meaningful tree structure this is utterly woeful. In a sensible hierarchical format, all the information relating to, say, Servers/1 above, wopuld be under a stanza with that heading, instead of having separate and unnested stanzas like Servers/1/Databases/template_postgis. If you could nest stanzas in ini file format it would probably do, but you can't, leading to the above major ugliness. cheers andrew
On Wed, Sep 22, 2010 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote: >> > No, it's really not hierarchical. It only has goes one level deep. >> >> I guess pgAdmin/wxWidgets are broken then :-) >> >> [Servers] >> Count=5 >> [Servers/1] >> Server=localhost > > Well, by that logic, even what we have now for postgresql.conf is > hierarchical. Well, yes - if you consider add-in GUCs which use prefixing like foo.setting=... > I think the criterion was rather meant to be > > - can represent hierarchies without repeating intermediate node names If this were data, I could understand that as it could lead to tremendous bloat, but as a config file, I'd rather have the readability of the ini format, despite the repeated node names, than have to hack XML files by hand. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
<br /><br /> On 09/22/2010 07:57 AM, Dave Page wrote: <blockquote cite="mid:AANLkTi=rJuRLv2jv=M5js9YPGLp94bciAWx0FtRei=YD@mail.gmail.com"type="cite"><pre wrap="">On Wed, Sep 22, 2010 at 12:50PM, Peter Eisentraut <a class="moz-txt-link-rfc2396E" href="mailto:peter_e@gmx.net"><peter_e@gmx.net></a> wrote: </pre><blockquote type="cite"><pre wrap="">On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote: </pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">No, it's really not hierarchical. It only has goes onelevel deep. </pre></blockquote><pre wrap=""> I guess pgAdmin/wxWidgets are broken then :-) [Servers] Count=5 [Servers/1] Server=localhost </pre></blockquote><pre wrap=""> Well, by that logic, even what we have now for postgresql.conf is hierarchical. </pre></blockquote><pre wrap=""> Well, yes - if you consider add-in GUCs which use prefixing like foo.setting=... </pre><blockquote type="cite"><pre wrap="">I think the criterion was rather meant to be - can represent hierarchies without repeating intermediate node names </pre></blockquote><pre wrap=""> If this were data, I could understand that as it could lead to tremendous bloat, but as a config file, I'd rather have the readability of the ini format, despite the repeated node names, than have to hack XML files by hand. </pre></blockquote><br /> XML is not the only alternative - please don't use it as a straw man. For example, here is a fragmentfrom the Bacula docs using their hierarchical format:<br /><br /><blockquote><pre>FileSet { Name = Test Include { File = /home/xxx/test Options { regex = ".*\.c$" } } } </pre></blockquote> Or here is a piece from the buildfarm client config (which is in fact perl, but could also be JSON orsimilar fairly easily):<br /><br /><blockquote>mail_events =><br /> {<br /> all => [], <br /> fail =>[],<br /> change => ['<a class="moz-txt-link-abbreviated" href="mailto:foo@bar.com">foo@bar.com</a>', '<a class="moz-txt-link-abbreviated"href="mailto:baz@blurfl.org">baz@blurfl.org</a>' ],<br /> green => [], <br /> },<br/> build_env =><br /> {<br /> CCACHE_DIR => "/home/andrew/pgfarmbuild/ccache/$branch", <br /> },<br /><br/></blockquote><pre> </pre> cheers<br /><br /> andrew<br /><pre> </pre>
On Wed, Sep 22, 2010 at 1:25 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > XML is not the only alternative - please don't use it as a straw man. For > example, here is a fragment from the Bacula docs using their hierarchical > format: > > FileSet { > Name = Test > Include { > File = /home/xxx/test > Options { > regex = ".*\.c$" > } > } > } > > Or here is a piece from the buildfarm client config (which is in fact perl, > but could also be JSON or similar fairly easily): > > mail_events => > { > all => [], > fail => [], > change => ['foo@bar.com', 'baz@blurfl.org' ], > green => [], > }, > build_env => > { > CCACHE_DIR => "/home/andrew/pgfarmbuild/ccache/$branch", > }, Both of which I've also used in the past, and also find uncomfortable and awkward for configuration files. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/22/2010 08:32 AM, Dave Page wrote: > On Wed, Sep 22, 2010 at 1:25 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >> XML is not the only alternative - please don't use it as a straw man. For >> example, here is a fragment from the Bacula docs using their hierarchical >> format: >> >> FileSet { >> Name = Test >> Include { >> File = /home/xxx/test >> Options { >> regex = ".*\.c$" >> } >> } >> } >> >> Or here is a piece from the buildfarm client config (which is in fact perl, >> but could also be JSON or similar fairly easily): >> >> mail_events => >> { >> all => [], >> fail => [], >> change => ['foo@bar.com', 'baz@blurfl.org' ], >> green => [], >> }, >> build_env => >> { >> CCACHE_DIR => "/home/andrew/pgfarmbuild/ccache/$branch", >> }, > Both of which I've also used in the past, and also find uncomfortable > and awkward for configuration files. > > I can't imagine trying to configure Bacula using ini file format - the mind just boggles. Frankly, I'd rather stick with our current config format than change to something as inadequate as ini file format. cheers andrew
On Wed, Sep 22, 2010 at 9:01 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > I can't imagine trying to configure Bacula using ini file format - the mind > just boggles. Frankly, I'd rather stick with our current config format than > change to something as inadequate as ini file format. Perhaps we need to define a little better what information we think we might eventually need to represent in the config file. With one exception, nobody has suggested anything that would actually require hierarchical structure. The exception is defining the policy for deciding when a commit has been sufficiently acknowledged by an adequate quorum of standbys, and it seems to me that doing that in its full generality is going to require not so much a hierarchical structure as a small programming language. The efforts so far have centered around reducing the use cases that $AUTHOR cares about to a set of GUCs which would satisfy that person's needs, but not necessarily everyone else's needs. I think efforts to encode arbitrary algorithms using configuration settings are doomed to failure, so I'm unimpressed by the argument that we should design the config file to support our attempts to do so. For everything else, no one has suggested that we need anything more complex than, essentially, a group of GUCs per server. So we could do: [server] guc=value or server.guc=value ...or something else. Designing this to support: server.hypothesis.experimental.unproven.imaginary.what-in-the-world-could-this-possibly-be = 42 ...seems pretty speculative at this point, unless someone can imagine what we'd want it for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-09-21 at 17:04 -0700, Josh Berkus wrote: > > That said, the timeout option also feels a bit wishy-washy to me. With a > > timeout, acknowledgment of a commit means "your transaction is safely > > committed in the master and slave. Or not, if there was some glitch with > > the slave". That doesn't seem like a very useful guarantee; if you're > > happy with that why not just use async replication? > > Ah, I wasn't clear. My thought was that a standby which exceeds the > timeout would be marked as "nonresponsive" and no longer included in the > list of standbys which needed to be synchronized. That is, the timeout > would be a timeout which says "this standby is down". > > > So the only case where standby registration is required is where you > > deliberately choose to *not* have N+1 redundancy and then yet still > > require all N standbys to acknowledge. That is a suicidal config and > > nobody would sanely choose that. It's not a large or useful use case for > > standby reg. (But it does raise the question again of whether we need > > quorum commit). This is becoming very confusing. Some people advocating "standby registration" have claimed it allows capabilities which aren't possible any other way; all but one of those claims has so far been wrong - the remaining case is described above. If I'm the one that is wrong, please tell me where I erred. > Thinking of this as a sysadmin, what I want is to have *one place* I can > go an troubleshoot my standby setup. If I have 12 synch standbys and > they're creating too much load on the master, and I want to change half > of them to async, I don't want to have to ssh into 6 different machines > to do so. If one standby needs to be taken out of the network because > it's too slow, I want to be able to log in to the master and instantly > identify which standby is lagging and remove it there. The above case is one where I can see your point and it does sound easier in that case. But I then think: "What happens after failover?". We would then need to have 12 different standby.conf files, one on each standby that describes what the setup would look like if that standby became the master. And guess what, every time we made a change on the master, you'd need to re-edit all 12 standby.conf files to reflect the new configuration. So we're still back to having to edit in multiple places, ISTM. Please, please, somebody write down what the design proposal is *before* we make a decision on whether it is a sensible way to proceed. It would be good to see a few options written down and some objective analysis of which way is best to let people decide. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Robert Haas wrote: > [server] > guc=value > > or > > server.guc=value ^^^^^^^^^^^^^^^^ Yes, this was my idea too. It uses our existing config file format. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 22 September 2010 17:23, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> [server] >> guc=value >> >> or >> >> server.guc=value > ^^^^^^^^^^^^^^^^ > > Yes, this was my idea too. It uses our existing config file format. > So... sync_rep_services = {critical: recv=2, fsync=2, replay=1; important: fsync=3; reporting:recv=2, apply=1} becomes ... sync_rep_services.critical.recv = 2 sync_rep_services.critical.fsync = 2 sync_rep_services.critical.replay = 2 sync_rep_services.important.fsync = 3 sync_rep_services.reporting.recv = 2 sync_rep_services.reporting.apply = 1 I actually started to give this example to demonstrate how cumbersome it would look... but now that I've just typed it out, I've changed my mind. I actually like it! -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
Thom Brown wrote: > On 22 September 2010 17:23, Bruce Momjian <bruce@momjian.us> wrote: > > Robert Haas wrote: > >> [server] > >> guc=value > >> > >> or > >> > >> server.guc=value > > ?^^^^^^^^^^^^^^^^ > > > > Yes, this was my idea too. ?It uses our existing config file format. > > > > So... > > sync_rep_services = {critical: recv=2, fsync=2, replay=1; > important: fsync=3; > reporting: recv=2, apply=1} > > becomes ... > > sync_rep_services.critical.recv = 2 > sync_rep_services.critical.fsync = 2 > sync_rep_services.critical.replay = 2 > sync_rep_services.important.fsync = 3 > sync_rep_services.reporting.recv = 2 > sync_rep_services.reporting.apply = 1 > > I actually started to give this example to demonstrate how cumbersome > it would look... but now that I've just typed it out, I've changed my > mind. I actually like it! It can be prone to mistyping, but it seems simple enough. We already through a nice error for mistypes in the sever logs. :-) I don't think we support 3rd level specifications, but we could. Looks very Java-ish. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote: > So... > > sync_rep_services = {critical: recv=2, fsync=2, replay=1; > important: fsync=3; > reporting: recv=2, apply=1} > > becomes ... > > sync_rep_services.critical.recv = 2 > sync_rep_services.critical.fsync = 2 > sync_rep_services.critical.replay = 2 > sync_rep_services.important.fsync = 3 > sync_rep_services.reporting.recv = 2 > sync_rep_services.reporting.apply = 1 > > I actually started to give this example to demonstrate how cumbersome > it would look... but now that I've just typed it out, I've changed my > mind. I actually like it! With respect, this is ugly. Very ugly. Why do we insist on cryptic parameters within a config file which should be set within the database by a super user. I mean really? ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS CRITICAL; ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; Or some such thing. I saw Heiiki's reply but really the idea that we are shoving this all into the postgresql.conf is cumbersome. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, Sep 22, 2010 at 12:51 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote: > >> So... >> >> sync_rep_services = {critical: recv=2, fsync=2, replay=1; >> important: fsync=3; >> reporting: recv=2, apply=1} >> >> becomes ... >> >> sync_rep_services.critical.recv = 2 >> sync_rep_services.critical.fsync = 2 >> sync_rep_services.critical.replay = 2 >> sync_rep_services.important.fsync = 3 >> sync_rep_services.reporting.recv = 2 >> sync_rep_services.reporting.apply = 1 >> >> I actually started to give this example to demonstrate how cumbersome >> it would look... but now that I've just typed it out, I've changed my >> mind. I actually like it! > > With respect, this is ugly. Very ugly. Why do we insist on cryptic > parameters within a config file which should be set within the database > by a super user. > > I mean really? > > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS > CRITICAL; > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; > > Or some such thing. I saw Heiiki's reply but really the idea that we are > shoving this all into the postgresql.conf is cumbersome. I think it should be a separate config file, and I think it should be a config file that can be edited using DDL commands as you propose. But it CAN'T be a system catalog, because, among other problems, that rules out cascading slaves, which are a feature a lot of people probably want to eventually have. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 22/09/10 20:00, Robert Haas wrote: > But it CAN'T be a system catalog, because, among other problems, that > rules out cascading slaves, which are a feature a lot of people > probably want to eventually have. FWIW it could be a system catalog backed by a flat file. But I'm not in favor of that for the other reasons I stated earlier. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 22, 2010 at 8:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote: Not speaking to the necessity of standby registration, but... >> Thinking of this as a sysadmin, what I want is to have *one place* I can >> go an troubleshoot my standby setup. If I have 12 synch standbys and >> they're creating too much load on the master, and I want to change half >> of them to async, I don't want to have to ssh into 6 different machines >> to do so. If one standby needs to be taken out of the network because >> it's too slow, I want to be able to log in to the master and instantly >> identify which standby is lagging and remove it there. > > The above case is one where I can see your point and it does sound > easier in that case. But I then think: "What happens after failover?". > We would then need to have 12 different standby.conf files, one on each > standby that describes what the setup would look like if that standby > became the master. And guess what, every time we made a change on the > master, you'd need to re-edit all 12 standby.conf files to reflect the > new configuration. So we're still back to having to edit in multiple > places, ISTM. An interesting option here might be to have "replication.conf" (instead of standby.conf) which would list all servers, and a postgresql.conf setting which would set the "local name" the master would then ignore. Then all PG servers (master+slave) would be able to have identical replication.conf files, only having to know their own "name". Their own name could be GUC, from postgresql.conf, or from command line options, or default to hostname, whatever.
On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > > I mean really? > > > > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS > > CRITICAL; > > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; > > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; > > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; > > > > Or some such thing. I saw Heiiki's reply but really the idea that we are > > shoving this all into the postgresql.conf is cumbersome. > > I think it should be a separate config file, and I think it should be > a config file that can be edited using DDL commands as you propose. > But it CAN'T be a system catalog, because, among other problems, that > rules out cascading slaves, which are a feature a lot of people > probably want to eventually have. I guarantee you there is a way around the cascade slave problem. I believe there will be "some" postgresql.conf pollution. I don't see any other way around that but the conf should be limited to things that literally have to be expressed in a conf for specific static purposes. I was talking with Bruce on Jabber and one of his concerns with my approach is "polluting the SQL space for non-admins". I certainly appreciate that my solution puts code in more places and that it may be more of a burden for the hackers. However, we aren't building this for hackers. Most hackers don't even use the product. We are building it for our community, which are by far user space developers and dbas. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
Heikki Linnakangas wrote: > On 22/09/10 20:00, Robert Haas wrote: > > But it CAN'T be a system catalog, because, among other problems, that > > rules out cascading slaves, which are a feature a lot of people > > probably want to eventually have. > > FWIW it could be a system catalog backed by a flat file. But I'm not in > favor of that for the other reasons I stated earlier. I thought we just eliminated flat file backing store for tables to improve replication behavior --- I don't see returning to that as a win. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 22/09/10 20:02, Heikki Linnakangas wrote: > On 22/09/10 20:00, Robert Haas wrote: >> But it CAN'T be a system catalog, because, among other problems, that >> rules out cascading slaves, which are a feature a lot of people >> probably want to eventually have. > > FWIW it could be a system catalog backed by a flat file. But I'm not in > favor of that for the other reasons I stated earlier. Huh, I just realized that my reply didn't make any sense. For some reason I thought you were saying that it can't be a catalog because backends need to access it without attaching to a database. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > >> > I mean really? >> > >> > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS >> > CRITICAL; >> > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; >> > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; >> > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; >> > >> > Or some such thing. I saw Heiiki's reply but really the idea that we are >> > shoving this all into the postgresql.conf is cumbersome. >> >> I think it should be a separate config file, and I think it should be >> a config file that can be edited using DDL commands as you propose. >> But it CAN'T be a system catalog, because, among other problems, that >> rules out cascading slaves, which are a feature a lot of people >> probably want to eventually have. > > I guarantee you there is a way around the cascade slave problem. And that would be...? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: >>> But it CAN'T be a system catalog, because, among other problems, that >>> rules out cascading slaves, which are a feature a lot of people >>> probably want to eventually have. >> >> I guarantee you there is a way around the cascade slave problem. > And that would be...? Indeed. If it's a catalog then it has to be exactly the same on the master and every slave; which is probably a constraint we don't want for numerous reasons, not only cascade arrangements. regards, tom lane
On Wed, 2010-09-22 at 13:26 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > >> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > >>> But it CAN'T be a system catalog, because, among other problems, that > >>> rules out cascading slaves, which are a feature a lot of people > >>> probably want to eventually have. > >> > >> I guarantee you there is a way around the cascade slave problem. > > > And that would be...? > > Indeed. If it's a catalog then it has to be exactly the same on the > master and every slave; which is probably a constraint we don't want > for numerous reasons, not only cascade arrangements. Unless I am missing something the catalog only needs information for its specific cluster. E.g; My Master is, I am master for. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
"Joshua D. Drake" <jd@commandprompt.com> writes: > Unless I am missing something the catalog only needs information for its > specific cluster. E.g; My Master is, I am master for. I think the "cluster" here is composed of all and any server partaking into the replication network, whatever its role and cascading level, because we only support one master. As soon as the setup is replicated too, you can edit the setup from the one true master and from nowhere else, so the single authority must contain the whole setup. Now that doesn't mean all lines in the setup couldn't refer to a provider which could be different from the master in the case of cascading. What I don't understand is why the replication network topology can't get serialized into a catalog? Then again, assuming that a catalog ain't possible, I guess any file based setup will mean manual syncing of the whole setup at all the servers participating in the replication? If that's the case, I'll say it again, it looks like a nightmare to admin and I'd much prefer having a distributed setup, where any standby's setup is simple and directed to a single remote node, its provider. Please note also that such an arrangement doesn't preclude from having a way to register the standbys (automatically please) and requiring some action to enable the replication from their provider, and possibly from the master. But as there's already the hba to setup, I'd think paranoid sites are covered already. Regards, -- dim
All: I feel compelled to point out that, to date, there have been three times as many comments on what format the configuration file should be as there have been on what options it should support and how large numbers of replicas should be managed. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com P.S. You folks aren't imaginative enough. Tab-delimited files. Random dot images. Ogham!
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > >> On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> >>> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: >>> >>>> But it CAN'T be a system catalog, because, among other problems, that >>>> rules out cascading slaves, which are a feature a lot of people >>>> probably want to eventually have. >>>> >>> I guarantee you there is a way around the cascade slave problem. >>> > > >> And that would be...? >> > > Indeed. If it's a catalog then it has to be exactly the same on the > master and every slave; which is probably a constraint we don't want > for numerous reasons, not only cascade arrangements. > It might be an idea to store the replication information outside of all clusters involved in the replication, to not depend on any failure of the master or any of the slaves. We've been using Apache's zookeeper http://hadoop.apache.org/zookeeper/ to keep track of configuration-like knowledge that must be distributed over a number of servers. While Zookeeper itself is probably not fit (java) to use in core Postgres to keep track of configuration information, what it provides seems like the perfect solution, especially group membership and a replicated directory-like database (with per directory node a value). regards, Yeb Havinga
On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote: > All: > > I feel compelled to point out that, to date, there have been three times > as many comments on what format the configuration file should be as > there have been on what options it should support and how large numbers > of replicas should be managed. I know, it's terrible!... I think it should be green. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On Wed, 2010-09-22 at 21:05 +0100, Thom Brown wrote: > On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote: > > All: > > > > I feel compelled to point out that, to date, there have been three times > > as many comments on what format the configuration file should be as > > there have been on what options it should support and how large numbers > > of replicas should be managed. > > I know, it's terrible!... I think it should be green. Remove the shadow please. > > -- > Thom Brown > Twitter: @darkixion > IRC (freenode): dark_ixion > Registered Linux user: #516935 > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
> The above case is one where I can see your point and it does sound > easier in that case. But I then think: "What happens after failover?". > We would then need to have 12 different standby.conf files, one on each > standby that describes what the setup would look like if that standby > became the master. And guess what, every time we made a change on the > master, you'd need to re-edit all 12 standby.conf files to reflect the > new configuration. So we're still back to having to edit in multiple > places, ISTM. Unless we can make the standby.conf files identical on all servers in the group. If we can do that, then conf file management utilities, fileshares, or a simple automated rsync could easily take care of things. But ... any setup which involves each standby being *required* to have a different configuration on each standby server, which has to be edited separately, is going to be fatally difficult to manage for anyone who has more than a couple of standbys. So I'd like to look at what it takes to get away from that. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote: > So... > > sync_rep_services = {critical: recv=2, fsync=2, replay=1; > important: fsync=3; > reporting: recv=2, apply=1} > > becomes ... > > sync_rep_services.critical.recv = 2 > sync_rep_services.critical.fsync = 2 > sync_rep_services.critical.replay = 2 > sync_rep_services.important.fsync = 3 > sync_rep_services.reporting.recv = 2 > sync_rep_services.reporting.apply = 1 > > I actually started to give this example to demonstrate how cumbersome > it would look... but now that I've just typed it out, I've changed my > mind. I actually like it! With respect, this is ugly. Very ugly. Why do we insist on cryptic parameters within a config file which should be set within the database by a super user. I mean really? ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS CRITICAL; ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; Or some such thing. I saw Heiiki's reply but really the idea that we are shoving this all into the postgresql.conf is cumbersome. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > > I mean really? > > > > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS > > CRITICAL; > > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2; > > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2; > > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2; > > > > Or some such thing. I saw Heiiki's reply but really the idea that we are > > shoving this all into the postgresql.conf is cumbersome. > > I think it should be a separate config file, and I think it should be > a config file that can be edited using DDL commands as you propose. > But it CAN'T be a system catalog, because, among other problems, that > rules out cascading slaves, which are a feature a lot of people > probably want to eventually have. I guarantee you there is a way around the cascade slave problem. I believe there will be "some" postgresql.conf pollution. I don't see any other way around that but the conf should be limited to things that literally have to be expressed in a conf for specific static purposes. I was talking with Bruce on Jabber and one of his concerns with my approach is "polluting the SQL space for non-admins". I certainly appreciate that my solution puts code in more places and that it may be more of a burden for the hackers. However, we aren't building this for hackers. Most hackers don't even use the product. We are building it for our community, which are by far user space developers and dbas. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, 2010-09-22 at 13:26 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > >> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > >>> But it CAN'T be a system catalog, because, among other problems, that > >>> rules out cascading slaves, which are a feature a lot of people > >>> probably want to eventually have. > >> > >> I guarantee you there is a way around the cascade slave problem. > > > And that would be...? > > Indeed. If it's a catalog then it has to be exactly the same on the > master and every slave; which is probably a constraint we don't want > for numerous reasons, not only cascade arrangements. Unless I am missing something the catalog only needs information for its specific cluster. E.g; My Master is, I am master for. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, 2010-09-22 at 21:05 +0100, Thom Brown wrote: > On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote: > > All: > > > > I feel compelled to point out that, to date, there have been three times > > as many comments on what format the configuration file should be as > > there have been on what options it should support and how large numbers > > of replicas should be managed. > > I know, it's terrible!... I think it should be green. Remove the shadow please. > > -- > Thom Brown > Twitter: @darkixion > IRC (freenode): dark_ixion > Registered Linux user: #516935 > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Mon, 2010-09-20 at 18:24 -0400, Robert Haas wrote: > I feel like that's really nice and simple. There are already 5 separate places to configure to make streaming rep work in a 2 node cluster (master.pg_hba.conf, master.postgresql.conf, standby.postgresql.conf, standby.recovery.conf, password file/ssh key). I haven't heard anyone say we would be removing controls from those existing areas, so it isn't clear to me how adding a 6th place will make things "nice and simple". Put simply, Standby registration is not required for most use cases. If some people want it, I'm happy that it can be optional. Personally, I want to make very sure that any behaviour that involves waiting around indefinitely can be turned off and should be off by default. ISTM very simple to arrange things so you can set parameters on the master OR on the standby, whichever is most convenient or desirable. Passing parameters around at handshake is pretty trivial. I do also understand that some parameters *must* be set in certain locations to gain certain advantages. Those can be documented. I would be happier if we could separate the *list* of control parameters we need from the issue of *where* we set those parameters. I would be even happier if we could agree on the top 3-5 parameters so we can implement those first. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 23/09/10 11:34, Csaba Nagy wrote: > In the meantime our DBs are not able to keep in sync via WAL > replication, that would need some kind of parallel WAL restore on the > slave I guess, or I'm not able to configure it properly - in any case > now we use slony which is working. It would be interesting to debug that case a bit more. Was bottlenecked by CPU or I/O, or network capacity perhaps? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi all, Some time ago I was also interested in this feature, and that time I also thought about complete setup possibility via postgres connections, meaning the transfer of the files and all configuration/slave registration to be done through normal backend connections. In the meantime our DBs are not able to keep in sync via WAL replication, that would need some kind of parallel WAL restore on the slave I guess, or I'm not able to configure it properly - in any case now we use slony which is working. In fact the way slony is doing the configuration could be a good place to look... On Wed, 2010-09-22 at 13:16 -0400, Robert Haas wrote: > > I guarantee you there is a way around the cascade slave problem. > > And that would be...? * restrict the local file configuration to a replication ID; * make all configuration refer to the replica ID; * keep all configuration in a shared catalog: it can be kept exactly the same on all replicas, as each replication "node" will only care about the configuration concerning it's own replica ID; * added advantage: after take-over the slave will change the configured master to it's own replica ID, and if the old master would ever connect again, it could easily notice that and give up; Cheers, Csaba.
On Thu, 2010-09-23 at 12:02 +0300, Heikki Linnakangas wrote: > On 23/09/10 11:34, Csaba Nagy wrote: > > In the meantime our DBs are not able to keep in sync via WAL > > replication, that would need some kind of parallel WAL restore on the > > slave I guess, or I'm not able to configure it properly - in any case > > now we use slony which is working. > > It would be interesting to debug that case a bit more. Was bottlenecked > by CPU or I/O, or network capacity perhaps? Unfortunately it was quite long time ago we last tried, and I don't remember exactly what was bottlenecked. Our application is quite write-intensive, the ratio of writes to reads which actually reaches the disk is about 50-200% (according to the disk stats - yes, sometimes we write more to the disk than we read, probably due to the relatively large RAM installed). If I remember correctly, the standby was about the same regarding IO/CPU power as the master, but it was not able to process the WAL files as fast as they were coming in, which excludes at least the network as a bottleneck. What I actually suppose happens is that the one single process applying the WAL on the slave is not able to match the full IO the master is able to do with all it's processors. If you're interested, I could try to set up another try, but it would be on 8.3.7 (that's what we still run). On 9.x would be also interesting, but that would be a test system and I can't possibly get there the load we have on production... Cheers, Csaba.
On 23/09/10 15:26, Csaba Nagy wrote: > Unfortunately it was quite long time ago we last tried, and I don't > remember exactly what was bottlenecked. Our application is quite > write-intensive, the ratio of writes to reads which actually reaches the > disk is about 50-200% (according to the disk stats - yes, sometimes we > write more to the disk than we read, probably due to the relatively > large RAM installed). If I remember correctly, the standby was about the > same regarding IO/CPU power as the master, but it was not able to > process the WAL files as fast as they were coming in, which excludes at > least the network as a bottleneck. What I actually suppose happens is > that the one single process applying the WAL on the slave is not able to > match the full IO the master is able to do with all it's processors. There's a program called pg_readahead somewhere on pgfoundry by NTT that will help if it's the single-threadedness of I/O. Before handing the WAL file to the server, it scans it through and calls posix_fadvise for all the blocks that it touches. When the server then replays it, the data blocks are already being fetched by the OS, using the whole RAID array. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > I think it should be a separate config file, and I think it should be > a config file that can be edited using DDL commands as you propose. > But it CAN'T be a system catalog, because, among other problems, that > rules out cascading slaves, which are a feature a lot of people > probably want to eventually have. ISTM that we can have a system catalog and still have cascading slaves. If we administer the catalog via the master, why can't we administer all slaves, however they cascade, via the master too? What other problems are there that mean we *must* have a file? I can't see any. Elsewhere, we've established that we can have unregistered standbys, so max_wal_senders cannot go away. If we do have a file, it will be a problem after failover since the file will be either absent or potentially out of date. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > ISTM that we can have a system catalog and still have cascading slaves. > If we administer the catalog via the master, why can't we administer all > slaves, however they cascade, via the master too? > What other problems are there that mean we *must* have a file? Well, for one thing, how do you add a new slave? If its configuration comes from a system catalog, it seems that it has to already be replicating before it knows what its configuration is. regards, tom lane
On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > ISTM that we can have a system catalog and still have cascading slaves. > > If we administer the catalog via the master, why can't we administer all > > slaves, however they cascade, via the master too? > > > What other problems are there that mean we *must* have a file? > > Well, for one thing, how do you add a new slave? If its configuration > comes from a system catalog, it seems that it has to already be > replicating before it knows what its configuration is. At the moment, I'm not aware of any proposed parameters that need to be passed from master to standby, since that was one of the arguments for standby registration in the first place. If that did occur, when the standby connects it would get told what parameters to use by the master as part of the handshake. It would have to work exactly that way with standby.conf on the master also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, Sep 23, 2010 at 11:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote: > >> I think it should be a separate config file, and I think it should be >> a config file that can be edited using DDL commands as you propose. >> But it CAN'T be a system catalog, because, among other problems, that >> rules out cascading slaves, which are a feature a lot of people >> probably want to eventually have. > > ISTM that we can have a system catalog and still have cascading slaves. > If we administer the catalog via the master, why can't we administer all > slaves, however they cascade, via the master too? Well, I guess we could, but is that really convenient? My gut feeling is no, but of course it's subjective. > What other problems are there that mean we *must* have a file? I can't > see any. Elsewhere, we've established that we can have unregistered > standbys, so max_wal_senders cannot go away. > > If we do have a file, it will be a problem after failover since the file > will be either absent or potentially out of date. I'm not sure about that. I wonder if we can actually turn this into a feature, with careful design. Suppose that you have the common configuration of two machines, A and B. At any give time, one is the master and one is the slave. And let's say you've opted for sync rep, apply mode, don't wait for disconnected standbys. Well, you can have a config file on A that defines B as the slave, and a config file on B that defines A as the slave. When failover happens, you still have to worry about taking a new base backup, removing recovery.conf from the new master and adding it to the slave, and all that stuff, but the standby config just works. Now, admittedly, in more complex topologies, and especially if you're using configuration options that pertain to the behavior of disconnected standbys (e.g. wait for them, or retain WAL for them), you're going to need to adjust the configs. But I think that's likely to be true anyway, even with a catalog. If A is doing sync rep and waiting for B even when B is disconnected, and the machines switch roles, it's hard to see how any configuration isn't going to need some adjustment. One thing that's nice about the flat file system is that you can make the configuration changes on the new master before you promote it (perhaps you had A replicating synchronously to B and B replicating asynchronously to C, but now that A is dead and B is promoted, you want the latter replication to become synchronous). Being able to make those kinds of changes before you start processing live transactions is possibly useful to some people. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote: >> Well, for one thing, how do you add a new slave? If its configuration >> comes from a system catalog, it seems that it has to already be >> replicating before it knows what its configuration is. > At the moment, I'm not aware of any proposed parameters that need to be > passed from master to standby, since that was one of the arguments for > standby registration in the first place. > If that did occur, when the standby connects it would get told what > parameters to use by the master as part of the handshake. It would have > to work exactly that way with standby.conf on the master also. Um ... so how does this standby know what master to connect to, what password to offer, etc? I don't think that "pass down parameters after connecting" is likely to cover anything but a small subset of the configuration problem. regards, tom lane
On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote: >>> Well, for one thing, how do you add a new slave? If its configuration >>> comes from a system catalog, it seems that it has to already be >>> replicating before it knows what its configuration is. > >> At the moment, I'm not aware of any proposed parameters that need to be >> passed from master to standby, since that was one of the arguments for >> standby registration in the first place. > >> If that did occur, when the standby connects it would get told what >> parameters to use by the master as part of the handshake. It would have >> to work exactly that way with standby.conf on the master also. > > Um ... so how does this standby know what master to connect to, what > password to offer, etc? I don't think that "pass down parameters after > connecting" is likely to cover anything but a small subset of the > configuration problem. Huh? We have that stuff already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > Now, admittedly, in more complex topologies, and especially if you're > using configuration options that pertain to the behavior of > disconnected standbys (e.g. wait for them, or retain WAL for them), > you're going to need to adjust the configs. But I think that's likely > to be true anyway, even with a catalog. If A is doing sync rep and > waiting for B even when B is disconnected, and the machines switch > roles, it's hard to see how any configuration isn't going to need some > adjustment. One thing that's nice about the flat file system is that > you can make the configuration changes on the new master before you > promote it Actually, that's the killer argument in this whole thing. If the configuration information is in a system catalog, you can't change it without the master being up and running. Let us suppose for example that you've configured hard synchronous replication such that the master can't commit without slave acks. Now your slaves are down and you'd like to change that setting. Guess what. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Um ... so how does this standby know what master to connect to, what >> password to offer, etc? �I don't think that "pass down parameters after >> connecting" is likely to cover anything but a small subset of the >> configuration problem. > Huh? We have that stuff already. Oh, I thought part of the objective here was to try to centralize that stuff. If we're assuming that slaves will still have local replication configuration files, then I think we should just add any necessary info to those files and drop this entire conversation. We're expending a tremendous amount of energy on something that won't make any real difference to the overall complexity of configuring a replication setup. AFAICS the only way you make a significant advance in usability is if you can centralize all the configuration information in some fashion. regards, tom lane
On Thu, Sep 23, 2010 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Um ... so how does this standby know what master to connect to, what >>> password to offer, etc? I don't think that "pass down parameters after >>> connecting" is likely to cover anything but a small subset of the >>> configuration problem. > >> Huh? We have that stuff already. > > Oh, I thought part of the objective here was to try to centralize that > stuff. If we're assuming that slaves will still have local replication > configuration files, then I think we should just add any necessary info > to those files and drop this entire conversation. We're expending a > tremendous amount of energy on something that won't make any real > difference to the overall complexity of configuring a replication setup. > AFAICS the only way you make a significant advance in usability is if > you can centralize all the configuration information in some fashion. Well, it's quite fanciful to suppose that the slaves aren't going to need to have local configuration for how to connect to the master. The configuration settings we're talking about here are the things that affect either the behavior of the master-slave system as a unit (like what kind of ACK the master needs to get from the slave before ACKing the commit back to the user) or the master alone (like tracking how much WAL needs to be retained for a particular disconnected slave, rather than as presently always retaining a fixed amount). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, 2010-09-23 at 16:18 +0300, Heikki Linnakangas wrote: > There's a program called pg_readahead somewhere on pgfoundry by NTT that > will help if it's the single-threadedness of I/O. Before handing the WAL > file to the server, it scans it through and calls posix_fadvise for all > the blocks that it touches. When the server then replays it, the data > blocks are already being fetched by the OS, using the whole RAID array. That sounds useful, thanks for the hint ! But couldn't this also be directly built in to WAL recovery process ? It would probably help a lot for recovering from a crash too. We did have recently a crash and it took hours to recover. I will try it out as soon as I get the time to set it up... [searching pgfoundry] Unfortunately I can't find it, and google is also not very helpful. Do you happen to have some links to it ? Cheers, Csaba.
On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote: > > What other problems are there that mean we *must* have a file? > > Well, for one thing, how do you add a new slave? If its configuration > comes from a system catalog, it seems that it has to already be > replicating before it knows what its configuration is. Or the slave gets a connection string to the master, and reads the configuration from there - it has to connect there anyway... The ideal bootstrap for a slave creation would be: get the params to connect to the master + the replica ID, and the rest should be done by connecting to the master and getting all the needed thing from there, including configuration. Maybe you see some merit for this idea: it wouldn't hurt to get the interfaces done so that the master could be impersonated by some WAL repository serving a PITR snapshot, and that the same WAL repository could connect as a slave to the master and instead of recovering the WAL stream, archive it. Such a WAL repository would possibly connect to multiple masters and could also get regularly snapshots too. This would provide a nice complement to WAL replication as PITR solution using the same protocols as the WAL standby. I have no idea if this would be easy to implement or useful for anybody. Cheers, Csaba.
On 23/09/10 20:03, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> Um ... so how does this standby know what master to connect to, what >>> password to offer, etc? I don't think that "pass down parameters after >>> connecting" is likely to cover anything but a small subset of the >>> configuration problem. > >> Huh? We have that stuff already. > > Oh, I thought part of the objective here was to try to centralize that > stuff. If we're assuming that slaves will still have local replication > configuration files, then I think we should just add any necessary info > to those files and drop this entire conversation. We're expending a > tremendous amount of energy on something that won't make any real > difference to the overall complexity of configuring a replication setup. > AFAICS the only way you make a significant advance in usability is if > you can centralize all the configuration information in some fashion. If you want the behavior where the master doesn't acknowledge a commit to the client until the standby (or all standbys, or one of them etc.) acknowledges it, even if the standby is not currently connected, the master needs to know what standby servers exist. *That's* why synchronous replication needs a list of standby servers in the master. If you're willing to downgrade to a mode where commit waits for acknowledgment only from servers that are currently connected, then you don't need any new configuration files. But that's not what I call synchronous replication, it doesn't give you the guarantees that textbook synchronous replication does. (Gosh, I wish the terminology was more standardized in this area) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-09-23 at 13:07 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Now, admittedly, in more complex topologies, and especially if you're > > using configuration options that pertain to the behavior of > > disconnected standbys (e.g. wait for them, or retain WAL for them), > > you're going to need to adjust the configs. But I think that's likely > > to be true anyway, even with a catalog. If A is doing sync rep and > > waiting for B even when B is disconnected, and the machines switch > > roles, it's hard to see how any configuration isn't going to need some > > adjustment. Well, its not at all hard to see how that could be configured, because I already proposed a simple way of implementing parameters that doesn't suffer from those problems. My proposal did not give roles to named standbys and is symmetrical, so switchovers won't cause a problem. Earlier you argued that centralizing parameters would make this nice and simple. Now you're pointing out that we aren't centralizing this at all, and it won't be simple. We'll have to have a standby.conf set up that is customised in advance for each standby that might become a master. Plus we may even need multiple standby.confs in case that we have multiple nodes down. This is exactly what I was seeking to avoid and exactly what I meant when I asked for an analysis of the failure modes. This proposal is a configuration nightmare, no question, and that is not the right way to go if you want high availability that works when you need it to. > One thing that's nice about the flat file system is that > > you can make the configuration changes on the new master before you > > promote it > > Actually, that's the killer argument in this whole thing. If the > configuration information is in a system catalog, you can't change it > without the master being up and running. Let us suppose for example > that you've configured hard synchronous replication such that the master > can't commit without slave acks. Now your slaves are down and you'd > like to change that setting. Guess what. If we have standby registration and I respect that some people want it, a table seems to be the best place for them. In a table the parameters are passed through from master to slave automatically without needing to synchronize multiple files manually. They can only be changed on a master, true. But since they only effect the behaviour of a master (commits => writes) then that doesn't matter at all. As soon as you promote a new master you'll be able to change them again, if required. Configuration options that differ on each node, depending upon the current state of others nodes are best avoided. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, Sep 23, 2010 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Well, its not at all hard to see how that could be configured, because I > already proposed a simple way of implementing parameters that doesn't > suffer from those problems. My proposal did not give roles to named > standbys and is symmetrical, so switchovers won't cause a problem. I know you proposed a way, but my angst is all around whether it was actually simple. I found it somewhat difficult to understand, so possibly other people might have the same problem. > Earlier you argued that centralizing parameters would make this nice and > simple. Now you're pointing out that we aren't centralizing this at all, > and it won't be simple. We'll have to have a standby.conf set up that is > customised in advance for each standby that might become a master. Plus > we may even need multiple standby.confs in case that we have multiple > nodes down. This is exactly what I was seeking to avoid and exactly what > I meant when I asked for an analysis of the failure modes. If you're operating on the notion that no reconfiguration will be necessary when nodes go down, then we have very different notions of what is realistic. I think that "copy the new standby.conf file in place" is going to be the least of the fine admin's problems. >> One thing that's nice about the flat file system is that >> > you can make the configuration changes on the new master before you >> > promote it >> >> Actually, that's the killer argument in this whole thing. If the >> configuration information is in a system catalog, you can't change it >> without the master being up and running. Let us suppose for example >> that you've configured hard synchronous replication such that the master >> can't commit without slave acks. Now your slaves are down and you'd >> like to change that setting. Guess what. > > If we have standby registration and I respect that some people want it, > a table seems to be the best place for them. In a table the parameters > are passed through from master to slave automatically without needing to > synchronize multiple files manually. > > They can only be changed on a master, true. But since they only effect > the behaviour of a master (commits => writes) then that doesn't matter > at all. As soon as you promote a new master you'll be able to change > them again, if required. Configuration options that differ on each node, > depending upon the current state of others nodes are best avoided. I think maybe you missed Tom's point, or else you just didn't respond to it. If the master is wedged because it is waiting for a standby, then you cannot commit transactions on the master. Therefore you cannot update the system catalog which you must update to unwedge it. Failing over in that situation is potentially a huge nuisance and extremely undesirable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-09-22 at 15:31 -0700, Josh Berkus wrote: > > The above case is one where I can see your point and it does sound > > easier in that case. But I then think: "What happens after failover?". > > We would then need to have 12 different standby.conf files, one on each > > standby that describes what the setup would look like if that standby > > became the master. And guess what, every time we made a change on the > > master, you'd need to re-edit all 12 standby.conf files to reflect the > > new configuration. So we're still back to having to edit in multiple > > places, ISTM. > > Unless we can make the standby.conf files identical on all servers in > the group. If we can do that, then conf file management utilities, > fileshares, or a simple automated rsync could easily take care of things. Would prefer table. > But ... any setup which involves each standby being *required* to have a > different configuration on each standby server, which has to be edited > separately, is going to be fatally difficult to manage for anyone who > has more than a couple of standbys. So I'd like to look at what it > takes to get away from that. Agreed. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-09-23 at 20:42 +0300, Heikki Linnakangas wrote: > If you want the behavior where the master doesn't acknowledge a > commit > to the client until the standby (or all standbys, or one of them > etc.) > acknowledges it, even if the standby is not currently connected, the > master needs to know what standby servers exist. *That's* why > synchronous replication needs a list of standby servers in the master. > > If you're willing to downgrade to a mode where commit waits for > acknowledgment only from servers that are currently connected, then > you don't need any new configuration files. As I keep pointing out, waiting for an acknowledgement from something that isn't there might just take a while. The only guarantee that provides is that you will wait a long time. Is my data more safe? No. To get zero data loss *and* continuous availability, you need two standbys offering sync rep and reply-to-first behaviour. You don't need standby registration to achieve that. > But that's not what I call synchronous replication, it doesn't give > you the guarantees that > textbook synchronous replication does. Which textbook? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 09/23/2010 10:09 PM, Robert Haas wrote: > I think maybe you missed Tom's point, or else you just didn't respond > to it. If the master is wedged because it is waiting for a standby, > then you cannot commit transactions on the master. Therefore you > cannot update the system catalog which you must update to unwedge it. > Failing over in that situation is potentially a huge nuisance and > extremely undesirable. Well, Simon is arguing that there's no need to wait for a disconnected standby. So that's not much of an issue. Regrads Markus Wanner
Simon, On 09/24/2010 12:11 AM, Simon Riggs wrote: > As I keep pointing out, waiting for an acknowledgement from something > that isn't there might just take a while. The only guarantee that > provides is that you will wait a long time. Is my data more safe? No. By now I agree that waiting for disconnected standbies is useless in master-slave replication. However, it makes me wonder where you draw the line between just temporarily unresponsive and disconnected. > To get zero data loss *and* continuous availability, you need two > standbys offering sync rep and reply-to-first behaviour. You don't need > standby registration to achieve that. Well, if your master reaches the false conclusion that both standbies are disconnected and happily continues without their ACKs (and the idiot admin being happy about having boosted database performance with whatever measure he recently took) you certainly don't have no zero data loss guarantee anymore. So for one, this needs a big fat warning that gets slapped on the admin's forehead in case of a disconnect. And second, the timeout for considering a standby to be disconnected should rather be large enough to not get false negatives. IIUC the master still waits for an ACK during that timeout. An infinite timeout doesn't have either of these issues, because there's no such distinction between temporarily unresponsive and disconnected. Regards Markus Wanner
On 24/09/10 01:11, Simon Riggs wrote: >> But that's not what I call synchronous replication, it doesn't give >> you the guarantees that >> textbook synchronous replication does. > > Which textbook? I was using that word metaphorically, but for example: Wikipedia http://en.wikipedia.org/wiki/Replication_%28computer_science%29 (includes a caveat that many commercial systemsskimp on it) Oracle docs http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repoverview.htm Scroll to "Synchronous Replication" Googling for "synchronous replication textbook" also turns up this actual textbook: Database Management Systems by R. Ramakrishnan & others which uses synchronous replication with this meaning, although in the context of multi-master replication. Interestingly, "Transaction Processing: Concepts and techniques" by Grey, Reuter, chapter 12.6.3, defines three levels: 1-safe - what we call asynchronous 2-safe - commit is acknowledged after the slave acknowledges it, but if the slave is down, fall back to asynchronous mode. 3-safe - commit is acknowledged only after slave acknowledges it. If it is down, refuse to commit In the context of multi-master replication, "eager replication" seems to be commonly used to mean synchronous replication. If we just want *something* that's useful, and want to avoid the hassle of registration and all that, I proposed a while back (http://archives.postgresql.org/message-id/4C7E29BC.3020902@enterprisedb.com) that we could aim for behavior that would be useful for distributing read-only load to slaves. The use case is specifically that you have one master and one or more hot standby servers. You also have something like pgpool that distributes all read-only queries across all the nodes, and routes updates to the master server. In this scenario, you want that the master node does not acknowledge a commit to the client until all currently connected standby servers have replayed the commit. Furthermore, you want a standby server to stop accepting queries if it loses connection to the master, to avoid giving out-of-date responses. With suitable timeouts in the master and the standby, it seems possible to guarantee that you can connect to any node in the system and get an up-to-date result. It does not give zero data loss like synchronous replication does, but it keeps hot standby servers trustworthy for queries. It bothers me that no-one seems to have a clear use case in mind. People want "synchronous replication", but don't seem to care much what guarantees it should provide. I wish the terminology was better standardized in this area. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 24/09/10 01:11, Simon Riggs wrote: > On Thu, 2010-09-23 at 20:42 +0300, Heikki Linnakangas wrote: >> If you want the behavior where the master doesn't acknowledge a >> commit >> to the client until the standby (or all standbys, or one of them >> etc.) >> acknowledges it, even if the standby is not currently connected, the >> master needs to know what standby servers exist. *That's* why >> synchronous replication needs a list of standby servers in the master. >> >> If you're willing to downgrade to a mode where commit waits for >> acknowledgment only from servers that are currently connected, then >> you don't need any new configuration files. > > As I keep pointing out, waiting for an acknowledgement from something > that isn't there might just take a while. The only guarantee that > provides is that you will wait a long time. Is my data more safe? No. It provides zero data loss, at the expense of availability. That's what synchronous replication is all about. > To get zero data loss *and* continuous availability, you need two > standbys offering sync rep and reply-to-first behaviour. Yes, that is a good point. I'm starting to understand what your proposal was all about. It makes sense when you think of a three node system configured for high availability with zero data loss like that. The use case of keeping hot standby servers up todate in a cluster where read-only queries are distributed across all nodes seems equally important though. What's the simplest method of configuration that supports both use cases? > You don't need standby registration to achieve that. Not necessarily I guess, but it creeps me out that a standby can just connect to the master and act as a synchronous slave, and there is no controls in the master on what standby servers there are. More complicated scenarios with quorums and different number of votes get increasingly complicated if there is no central place to configure it. But maybe we can ignore the more complicated setups for now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-09-23 at 14:26 +0200, Csaba Nagy wrote: > Unfortunately it was quite long time ago we last tried, and I don't > remember exactly what was bottlenecked. Our application is quite > write-intensive, the ratio of writes to reads which actually reaches > the disk is about 50-200% (according to the disk stats - yes, > sometimes we write more to the disk than we read, probably due to the > relatively large RAM installed). If I remember correctly, the standby > was about the same regarding IO/CPU power as the master, but it was > not able to process the WAL files as fast as they were coming in, > which excludes at least the network as a bottleneck. What I actually > suppose happens is that the one single process applying the WAL on the > slave is not able to match the full IO the master is able to do with > all it's processors. > > If you're interested, I could try to set up another try, but it would > be on 8.3.7 (that's what we still run). On 9.x would be also > interesting... Substantial performance improvements came in 8.4 with bgwriter running in recovery. That meant that the startup process didn't need to spend time doing restartpoints and could apply changes continuously. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-09-23 at 16:09 -0400, Robert Haas wrote: > On Thu, Sep 23, 2010 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Well, its not at all hard to see how that could be configured, because I > > already proposed a simple way of implementing parameters that doesn't > > suffer from those problems. My proposal did not give roles to named > > standbys and is symmetrical, so switchovers won't cause a problem. > > I know you proposed a way, but my angst is all around whether it was > actually simple. I found it somewhat difficult to understand, so > possibly other people might have the same problem. Let's go back to Josh's 12 server example. This current proposal requires 12 separate and different configuration files each containing many parameters that require manual maintenance. I doubt that people looking at that objectively will decide that is the best approach. We need to arrange a clear way for people to decide for themselves. I'll work on that. > > Earlier you argued that centralizing parameters would make this nice and > > simple. Now you're pointing out that we aren't centralizing this at all, > > and it won't be simple. We'll have to have a standby.conf set up that is > > customised in advance for each standby that might become a master. Plus > > we may even need multiple standby.confs in case that we have multiple > > nodes down. This is exactly what I was seeking to avoid and exactly what > > I meant when I asked for an analysis of the failure modes. > > If you're operating on the notion that no reconfiguration will be > necessary when nodes go down, then we have very different notions of > what is realistic. I think that "copy the new standby.conf file in > place" is going to be the least of the fine admin's problems. Earlier you argued that setting parameters on each standby was difficult and we should centralize things on the master. Now you tell us that actually we do need lots of settings on each standby and that to think otherwise is not realistic. That's a contradiction. The chain of argument used to support this as being a sensible design choice is broken or contradictory in more than one place. I think we should be looking for a design using the KISS principle, while retaining sensible tuning options. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Tom Lane <tgl@sss.pgh.pa.us> writes: > Oh, I thought part of the objective here was to try to centralize that > stuff. If we're assuming that slaves will still have local replication > configuration files, then I think we should just add any necessary info > to those files and drop this entire conversation. We're expending a > tremendous amount of energy on something that won't make any real > difference to the overall complexity of configuring a replication setup. > AFAICS the only way you make a significant advance in usability is if > you can centralize all the configuration information in some fashion. +1, but for real usability you have to make it so that this central setup can be edited from any member of the replication. HINT: plproxy. Regards, -- dim
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > If you want the behavior where the master doesn't acknowledge a commit to > the client until the standby (or all standbys, or one of them etc.) > acknowledges it, even if the standby is not currently connected, the master > needs to know what standby servers exist. *That's* why synchronous > replication needs a list of standby servers in the master. And this list can be maintained in a semi-automatic fashion: - adding to the list is done by the master as soon as a standby connects maybe we need to add a notion of "fqdn" in thestandby setup? - service level and current weight and any other knob that comes from the standby are changed on the fly by the master ifthat changes on the standby (default async, 1, but SIGHUP please) - current standby position (LSN for recv, fsync and replayed) of the standby, as received in the "feedback loop" are changedon the fly by the master - removing a standby has to be done manually, using an admin function that's the only way to sort out permanent vs transientunavailability - checking the current values in this list is done on the master by using some system view based on a SRF, as already said Regards, -- dim
On Fri, 2010-09-24 at 11:08 +0300, Heikki Linnakangas wrote: > On 24/09/10 01:11, Simon Riggs wrote: > >> But that's not what I call synchronous replication, it doesn't give > >> you the guarantees that > >> textbook synchronous replication does. > > > > Which textbook? > > I was using that word metaphorically, but for example: > > Wikipedia > http://en.wikipedia.org/wiki/Replication_%28computer_science%29 > (includes a caveat that many commercial systems skimp on it) Yes, I read that. The example it uses shows only one standby, which does suffer from the problem/caveat it describes. Two standbys resolves that problem, yet there is no mention of multiple standbys in Wikipedia. > Oracle docs > > http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repoverview.htm > Scroll to "Synchronous Replication" That document refers to sync rep *only* in the context of multimaster replication. We aren't discussing that here and so that link is not relevant at all. Oracle Data Guard in Maximum availability mode is roughly where I think we should be aiming http://download.oracle.com/docs/cd/B10500_01/server.920/a96653/concepts.htm#1033871 But I disagree with consulting other companies' copyrighted material, and I definitely don't like their overcomplicated configuration. And they have not yet thought of per-transaction controls. So I believe we should learn many lessons from them, but actually ignore and surpass them. Easily. > Googling for "synchronous replication textbook" also turns up this > actual textbook: > Database Management Systems by R. Ramakrishnan & others > which uses synchronous replication with this meaning, although in the > context of multi-master replication. > > Interestingly, "Transaction Processing: Concepts and techniques" by > Grey, Reuter, chapter 12.6.3, defines three levels: > > 1-safe - what we call asynchronous > 2-safe - commit is acknowledged after the slave acknowledges it, but if > the slave is down, fall back to asynchronous mode. > 3-safe - commit is acknowledged only after slave acknowledges it. If it > is down, refuse to commit Which again is a one-standby viewpoint on the problem. Wikipedia is right that there is a problem when using just one server. "3-safe" mode is not more safe than "2-safe" mode when you have 2 standbys. If you want high availability you need N+1 redundancy. If you want a standby server that is N=1. If you want a highly available standby configuration then N+1 = 2. Show me the textbook that describes what happens with 2 standbys. If one exists, I'm certain it would agree with my analysis. (I'll read and comment on your other points later today.) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-09-24 at 11:43 +0300, Heikki Linnakangas wrote: > > To get zero data loss *and* continuous availability, you need two > > standbys offering sync rep and reply-to-first behaviour. > > Yes, that is a good point. > > I'm starting to understand what your proposal was all about. It makes > sense when you think of a three node system configured for high > availability with zero data loss like that. > > The use case of keeping hot standby servers up todate in a cluster > where > read-only queries are distributed across all nodes seems equally > important though. What's the simplest method of configuration that > supports both use cases? That is definitely the right question. (More later) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Robert Haas <robertmhaas@gmail.com> writes: > I think maybe you missed Tom's point, or else you just didn't respond > to it. If the master is wedged because it is waiting for a standby, > then you cannot commit transactions on the master. Therefore you > cannot update the system catalog which you must update to unwedge it. > Failing over in that situation is potentially a huge nuisance and > extremely undesirable. All Wrong. You might remember that Simon's proposal begins with per-transaction synchronous replication behavior? Regards, -- dim
On 24/09/10 13:57, Simon Riggs wrote: > If you want high availability you need N+1 redundancy. If you want a > standby server that is N=1. If you want a highly available standby > configuration then N+1 = 2. Yep. Synchronous replication with one standby gives you zero data loss. When you add a 2nd standby as you described, then you have a reasonable level of high availability as well, as you can continue processing transactions in the master even if one slave dies. > Show me the textbook that describes what happens with 2 standbys. If one > exists, I'm certain it would agree with my analysis. I don't disagree with your analysis about multiple standbys and high availability. What I'm saying is that in a two standby situation, if you're willing to continue operation as usual in the master even if the standby is down, you're not doing synchronous replication. Extending that to a two standby situation, my claim is that if you're willing to continue operation as usual in the master when both standbys are down, you're not doing synchronous replication. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote: > What I'm saying is that in a two standby situation, if > you're willing to continue operation as usual in the master even if > the standby is down, you're not doing synchronous replication. Oracle and I disagree with you on that point, but I am more interested in behaviour than semantics. If you have two standbys and one is down, please explain how data loss has occurred. > Extending that to a two standby situation, my claim is that if you're > willing to continue operation as usual in the master when both > standbys are down, you're not doing synchronous replication. Agreed. But you still need to decide how you will act. I choose pragmatism in that case. Others have voiced that they would like the database to shutdown or have all sessions hang. I personally doubt their employers would feel the same way. Arguing technical correctness would seem unlikely to allow a DBA to keep his job if they stood and watched the app become unavailable. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Fri, Sep 24, 2010 at 6:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Earlier you argued that centralizing parameters would make this nice and >> > simple. Now you're pointing out that we aren't centralizing this at all, >> > and it won't be simple. We'll have to have a standby.conf set up that is >> > customised in advance for each standby that might become a master. Plus >> > we may even need multiple standby.confs in case that we have multiple >> > nodes down. This is exactly what I was seeking to avoid and exactly what >> > I meant when I asked for an analysis of the failure modes. >> >> If you're operating on the notion that no reconfiguration will be >> necessary when nodes go down, then we have very different notions of >> what is realistic. I think that "copy the new standby.conf file in >> place" is going to be the least of the fine admin's problems. > > Earlier you argued that setting parameters on each standby was difficult > and we should centralize things on the master. Now you tell us that > actually we do need lots of settings on each standby and that to think > otherwise is not realistic. That's a contradiction. You've repeatedly accused me and others of contradicting ourselves. I don't think that's helpful in advancing the debate, and I don't think it's what I'm doing. The point I'm trying to make is that when failover happens, lots of reconfiguration is going to be needed. There is just no getting around that. Let's ignore synchronous replication entirely for a moment. You're running 9.0 and you have 10 slaves. The master dies. You promote a slave. Guess what? You need to look at each slave you didn't promote and adjust primary_conninfo. You also need to check whether the slave has received an xlog record with a higher LSN than the one you promoted. If it has, you need to take a new base backup. Otherwise, you may have data corruption - very possibly silent data corruption. Do you dispute this? If so, on which point? The reason I think that we should centralize parameters on the master is because they affect *the behavior of the master*. Controlling whether the master will wait for the slave on the slave strikes me (and others) as spooky action at a distance. Configuring whether the master will retain WAL for a disconnected slave on the slave is outright byzantine. Of course, configuring these parameters on the master means that when the master changes, you're going to need a configuration (possibly the same, possibly different) for said parameters on the new master. But since you may be doing a lot of other adjustment at that point anyway (e.g. new base backups, changes in the set of synchronous slaves) that doesn't seem like a big deal. > The chain of argument used to support this as being a sensible design choice is broken or contradictory in more than one > place. I think we should be looking for a design using the KISS principle, while retaining sensible tuning options. The KISS principle is exactly what I am attempting to apply. Configuring parameters that affect the master on some machine other than the master isn't KISS, to me. You may find that broken or contradictory, but I disagree. I am attempting to disagree respectfully, but statements like the above make me feel like you're flaming, and that's getting under my skin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Sep 24, 2010 at 7:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote: >> What I'm saying is that in a two standby situation, if >> you're willing to continue operation as usual in the master even if >> the standby is down, you're not doing synchronous replication. > > Oracle and I disagree with you on that point, but I am more interested > in behaviour than semantics. I *think* he meant s/two standby/two server/. That's taken from the 2 references: *the* master *the* slave. In that case, if the master is committing w/ no slave connected, it *isn't* repliation, synchronous or not. Usefull, likely, but replication, not at that PIT. > If you have two standbys and one is down, please explain how data loss > has occurred. Right, of course. But thinking he meant 2 servers (1 standby) not 3 servers (2 standby). But even with only 2 server, if it's down and the master is up, there isn't data loss. There's *potential* for dataloss. > But you still need to decide how you will act. I choose pragmatism in > that case. > > Others have voiced that they would like the database to shutdown or have > all sessions hang. I personally doubt their employers would feel the > same way. Arguing technical correctness would seem unlikely to allow a > DBA to keep his job if they stood and watched the app become > unavailable. Again, it all depends on the business. Synchronous replication can give you two things: 1) High Availability (Just answer my queries, dammit!) 2) High Durability (Don't give me an answer unless your damn well sure it's the right one) and its goal is to do that in the face of "catastrophic failure" (for some level of catastrophic). It's the trade of between: 1) The cost of delaying/refusing transactions being greater than the potential cost of a lost transaction 2) The cost of lost transaction being greater than the cost of delaying/refusing transactions So there are people who want to use PostgreSQL in a situation where they'ld much rather not "say" they have done something unless they are sure it's safely written in 2 different systems, in 2 different locations (and yes, the distance between those two locations will be a trade off wrt performance, and the business will need to decide on their risk levels). I understand it's optimal, desireable, or even praactical for the vast majority of cases. I don't want it to be impossible, or, if it's decide that it will be impossible, hopefully not just because you decided nobody ever needs it, but that its not feasible because of code/implimentation complexitites ;-)
On 24/09/10 14:47, Simon Riggs wrote: > On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote: >> What I'm saying is that in a two standby situation, if >> you're willing to continue operation as usual in the master even if >> the standby is down, you're not doing synchronous replication. > > Oracle and I disagree with you on that point, but I am more interested > in behaviour than semantics. > > If you have two standbys and one is down, please explain how data loss > has occurred. Sorry, that was a typo. As Aidan guessed, I meant "even in a two server situation", ie. one master and one slave. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, Defending my ideas as not to be put in the bag you're wanting to put away. We have more than 2 proposals lying around here. I'm one of the guys with a proposal and no code, but still trying to be clear. Robert Haas <robertmhaas@gmail.com> writes: > The reason I think that we should centralize parameters on the master > is because they affect *the behavior of the master*. Controlling > whether the master will wait for the slave on the slave strikes me > (and others) as spooky action at a distance. I hope it's clear that I didn't propose anything like this in the related threads. What you setup on the slave is related only to what the slave has to offer to the master. What happens on the master wrt with waiting etc is setup on the master, and is controlled per-transaction. As my ideas come in good parts from understanding Simon work and proposal, my feeling is that stating them here will help the thread. > Configuring whether the > master will retain WAL for a disconnected slave on the slave is > outright byzantine. Again, I can't remember having proposed such a thing. > Of course, configuring these parameters on the > master means that when the master changes, you're going to need a > configuration (possibly the same, possibly different) for said > parameters on the new master. But since you may be doing a lot of > other adjustment at that point anyway (e.g. new base backups, changes > in the set of synchronous slaves) that doesn't seem like a big deal. Should we take some time and define the behaviors we expect in the cluster, and the ones we want to provide in case of each error case we can think about, we'd be able to define the set of parameters that we need to operate the system. Then, some of us are betting than it will be possible to accommodate with either a unique central setup that you edit in only one place at failover time, *or* that the best way to manage the setup is having it distributed. Granted, given how it currently works, it looks like you will have to edit the primary_conninfo on a bunch of standbys at failover time, e.g. I'd like that we now follow Josh Berkus (and some other) advice now, and start a new thread to decide what we mean by synchronous replication, what kind of normal behaviour we want and what responses to errors we expect to be able to deal with in what (optional) ways. Because the more we're staying on this thread, and the clearer it is that there isn't two of us talking about the same synchronous replication feature set. Regards, -- dim
On Fri, 2010-09-24 at 16:01 +0200, Dimitri Fontaine wrote: > I'd like that we now follow Josh Berkus (and some other) advice now, and > start a new thread to decide what we mean by synchronous replication, > what kind of normal behaviour we want and what responses to errors we > expect to be able to deal with in what (optional) ways. What I intend to do from here is make a list of all desired use cases, then ask for people to propose ways of configuring those. Hopefully we don't need to discuss the meaning of the phrase "sync rep", we just need to look at the use cases. That way we will be able to directly compare the flexibility/complexity/benefits of configuration between different proposals. I think this will allows us to rapidly converge on something useful. If multiple solutions exist, we may then be able to decide/vote on a prioritisation of use cases to help resolve any difficulty. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 24/09/10 17:13, Simon Riggs wrote: > On Fri, 2010-09-24 at 16:01 +0200, Dimitri Fontaine wrote: > >> I'd like that we now follow Josh Berkus (and some other) advice now, and >> start a new thread to decide what we mean by synchronous replication, >> what kind of normal behaviour we want and what responses to errors we >> expect to be able to deal with in what (optional) ways. > > What I intend to do from here is make a list of all desired use cases, > then ask for people to propose ways of configuring those. Hopefully we > don't need to discuss the meaning of the phrase "sync rep", we just need > to look at the use cases. Yes, that seems like a good way forward. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 24, 2010 at 10:01 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: >> Configuring whether the >> master will retain WAL for a disconnected slave on the slave is >> outright byzantine. > > Again, I can't remember having proposed such a thing. No one has, but I keep hearing we don't need the master to have a list of standbys and a list of properties for each standby... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company