Thread: Hot Standby query cancellation and Streaming Replication integration
I'm happy to see we've crossed the point where the worst of the Hot Standby and Streaming Replication issues are sorted out. A look at the to-do lists: http://wiki.postgresql.org/wiki/Hot_Standby_TODO http://wiki.postgresql.org/wiki/Streaming_Replication show no Must-fix items and 5 Serious Issues for Hot Standby left; there are 9 Streaming Replication items there, which aren't as clearly prioritized yet. Correct me if I'm wrong here, but those read to me like tweaks and polishing rather than major architecture issues at this point, so I believe that code is the same position as HS: some serious issues, but no really terrible parts. The job Simon asked me to take a look at starting last week is which of the listed HS "Serious Issues" might be promoted into must-fix items after seeing how easy they were to encounter. There are a number of HS tunables that interact with one another, and depending your priorities a few ways you can try to optimize the configuration for what I expect to be common use cases for this feature. I've written a blog entry at http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html that tries to explain all that background clearly, and relate the implementation details to how I expect DBAs will perceive them. That was a bit much to also cover here, and had a broader audience that might appreciate it than just this list. Attached is a tar file with some test case demo scripts that demonstrate the worst of the problems here IMHO. A README in there outlines how to set the problem demo up (presuming you've already gotten a HS pair going). What this does is execute the following sequence continuously on the master: UPDATE pgbench_tellers SET tbalance = tbalance + <delta> WHERE tid = <tid>; (several times) VACUUM pgbench_tellers; Meanwhile, on the standby, the following long query runs on a few sessions at once, again looping constantly: SELECT sum(abalance) FROM pgbench_accounts; It took a bit of testing to get the database scale and iteration times here to easily encounter the issue here on my system, I hope this shows up easily enough for others with the values used. (I have a similar work in progress demo that tries to trigger the b-tree deletion problem too, will follow up once the storm of messages about this topic dies down, as I think this is a pre-requisite for it anyway) I'm not sure what you might be expecting from the above combination, but what actually happens is that many of the SELECT statements on the table *that isn't even being updated* are canceled. You see this in the logs: LOG: restored log file "0000000100000000000000A5" from archive ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. STATEMENT: SELECT sum(abalance) FROM pgbench_accounts; Basically, every time a WAL segment appears that wipes out a tuple that SELECT expects should still be visible, because the dead row left behind by the update has been vacuumed away, the query is canceled. This happens all the time the way I've set this up, and I don't feel like this is a contrived demo. Having a long-running query on the standby while things get updated and then periodically autovacuumed on the primary is going to be extremely common in the sorts of production systems I expect want HS the most. Now, as explained on the blog entry and in the documentation, there are all sorts of ways you can work around this issue by tweaking parameters or doing fun tricks with dblink. You can prioritize any two of keeping the standby current, letting long-running queries execute on the standby, and keeping xid advances on the master moving forward as fast as possible. But you can't get all three at once. The choices available are really about the best you can do given a system that's basically the old warm-standby approach, improved with adding just Hot Standby to the mix. Sure, you might make the conflict resolution a bit smarter or make the UI for setting the parameters more friendly, and there's already been plenty of argument and patching over all of that. I don't want to belittle that work because it's been important to make HS a useful standalone feature, but I feel like that's all triage rather than looking for the most robust fix possible. If you're running a system that also is using Streaming Replication, there is a much better approach possible. This idea has been floating around for a while and I am not taking credit for inventing it (too busy tonight to dig into the archives to figure out exactly when this popped up initially and who deserves credit for it). I'm just pointing out that now is the time where it's actually possible to implement. The HS TODO already includes the following action item, to resolve a serious issue you can run into (that itself would be great to eliminate): "Requires keep-alives with timestamps to be added to sync rep feature" If those keep-alives flowed in both directions, and included both timestamps *and* xid visibility information, the master could easily be configured to hold open xid snapshots needed for long running queries on the standby when that was necessary. I might be missing an implementation detail here, but from a high level it seems like you could make the walreceiver on the master publish the information about where the standby has advanced to as a bit of ProcArray xmin data. Then the master could only advance past where the standby says it cannot need visibility behind anymore. This is a much more elegant solution than any of the hacks available so far. It would turn Hot Standby + Streaming Replication into a system that stepped out of the way of the worst of the technical limitations of HS alone. The master would easily and automatically avoid advancing past where the queries running on the standby needed visibility back to, essentially the same way cleanup is blocked during a long-running query on the primary--except with the actual main query work offloaded to the standby, the idea all along. I don't know how difficult the keepalive feature was expected to be, and there's certainly plenty of potential landmines in this whole xid export idea. How to handle situations where the standby goes away for a while, such as a network outage, so that it doesn't block the master from ever cleaning up dead tuples is a concern. I wouldn't expect that to be too serious of a blocker, given that if the standby isn't talking it probably isn't running queries you need to worry about canceling either. Not sure where else this can fall down, and unfortunately I don't know nearly enough about the SR code to help myself with implementing this feature. (I think Simon is in a similar position--it's just not what we've been staring at the last few months). But I do know that the current Hot Standby implementation is going to be frustrating to configure correctly for people. If it's possible to make most of that go away just by doing some final integration between it and Streaming Replication that just wasn't practical to accomplish until now, I think it's worth considering how to make that happen before the final 9.0 release. I really hope this discussion can say focused on if and how it's possible to improve this area, with the goal being to deliver a product everyone can be proud of with the full feature set that makes this next release a killer one. The features that have managed to all get into this release already are fantastic, everyone who contributed should be proud of that progress, and it's encouraging that the alpha4 date was nailed. It would be easy to descend into finger-pointing for why exactly this particular problem is only getting more visibility now, or into schedule-oriented commentary suggesting it must be ignored because it's too late to do anything about it. I hope everyone appreciates wandering that way will not help make PostgreSQL 9.0 a better release. This issue is so easy to encounter, and looks so bad when it happens, that I feel it could easily lead to an embarrassing situation for the community if something isn't done about it before release. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Attachment
On Fri, Feb 26, 2010 at 8:33 AM, Greg Smith <greg@2ndquadrant.com> wrote: > > I'm not sure what you might be expecting from the above combination, but > what actually happens is that many of the SELECT statements on the table > *that isn't even being updated* are canceled. You see this in the logs: Well I proposed that the default should be to wait forever when applying WAL logs that conflict with a query. Precisely because I think the expectation is that things will "just work" and queries not fail unpredictably. Perhaps in your test a larger max_standby_delay would have prevented the cancellations but then as soon as you try a query which lasts longer you would have to raise it again. There's no safe value which will be right for everyone. > If you're running a system that also is using Streaming Replication, there > is a much better approach possible. So I think one of the main advantages of a log shipping system over the trigger-based systems is precisely that it doesn't require the master to do anything it wasn't doing already. There's nothing the slave can do which can interfere with the master's normal operation. This independence is really a huge feature. It means you can allow users on the slave that you would never let near the master. The master can continue running production query traffic while users run all kinds of crazy queries on the slave and drive it into the ground and the master will continue on blithely unaware that anything's changed. In the model you describe any long-lived queries on the slave cause tables in the master to bloat with dead records. I think this model is on the roadmap but it's not appropriate for everyone and I think one of the benefits of having delayed it is that it forces us to get the independent model right before throwing in extra complications. It would be too easy to rely on the slave feedback as an answer for hard questions about usability if we had it and just ignore the question of what to do when it's not the right solution for the user. -- greg
Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Greg Smith wrote: > Attached is a tar file with some test case demo scripts that demonstrate > the worst of the problems here IMHO. Thanks for that! We've been discussing this for ages, so it's nice to have a concrete example. > I don't want to belittle that work because it's been important to make > HS a useful standalone feature, but I feel like that's all triage rather > than looking for the most robust fix possible. Ideally the standby would stash away the old pages or tuples somewhere so that it can still access them even after replaying the WAL records that remove them from the main storage. I realize that's not going to happen any time soon because it's hard to do, but that would really be the most robust fix possible. > I don't know how difficult the keepalive feature was expected to be, and > there's certainly plenty of potential landmines in this whole xid export > idea. One such landmine is that the keepalives need to flow from client to server while the WAL records are flowing from server to client. We'll have to crack that problem for synchronous replication too, but I think that alone is a big enough problem to make this 9.1 material. > How to handle situations where the standby goes away for a while, > such as a network outage, so that it doesn't block the master from ever > cleaning up dead tuples is a concern. Yeah, that's another issue that needs to be dealt with. You'd probably need some kind of a configurable escape valve in the master, to let it ignore a standby's snapshot once it gets too old. > But I do know that the current Hot Standby implementation is going to be > frustrating to configure correctly for people. Perhaps others who are not as deep into the code as I am will have a better view on this, but I seriously don't think that's such a big issue. I think the max_standby_delay setting is quite intuitive and easy to explain. Sure, it would better if there was no tradeoff between killing queries and stalling recovery, but I don't think it'll be that hard to understand the tradeoff. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 26/02/10 08:33, Greg Smith wrote: > There are a number of HS > tunables that interact with one another, and depending your priorities a > few ways you can try to optimize the configuration for what I expect to > be common use cases for this feature. > I've written a blog entry at > http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html > that tries to explain all that background clearly, It did too. Thanks for the nice summary people can be pointed at. > I'm not sure what you might be expecting from the above combination, but > what actually happens is that many of the SELECT statements on the table > *that isn't even being updated* are canceled. You see this in the logs: Hmm - this I'd already figured out for myself. It's just occurred to me that this could well be the case between databases too. Database A gets vacuumed, B gets its queries kicked off on the standby. Granted lots of people just have the one main DB, but even so... > LOG: restored log file "0000000100000000000000A5" from archive > ERROR: canceling statement due to conflict with recovery > DETAIL: User query might have needed to see row versions that must be > removed. > STATEMENT: SELECT sum(abalance) FROM pgbench_accounts; > > Basically, every time a WAL segment appears that wipes out a tuple that > SELECT expects should still be visible, because the dead row left behind > by the update has been vacuumed away, the query is canceled. This > happens all the time the way I've set this up, and I don't feel like > this is a contrived demo. Having a long-running query on the standby > while things get updated and then periodically autovacuumed on the > primary is going to be extremely common in the sorts of production > systems I expect want HS the most. I can pretty much everyone wanting HS+SR. Thousands of small DBs running on VMs for a start. Free mostly-live backup? Got to be a winner. Dumb non-hacker question: why do we cancel all transactions rather than just those with "ACCESS SHARE" on the vacuumed table in question? Is it the simple fact that we don't know what table this particular section of WAL affects, or is it the complexity of tracking all this info? > If you're running a system that also is using Streaming Replication, > there is a much better approach possible. > "Requires keep-alives with timestamps to be added to sync rep feature" > > If those keep-alives flowed in both directions, and included both > timestamps *and* xid visibility information, the master could easily be > configured to hold open xid snapshots needed for long running queries on > the standby when that was necessary. Presumably meaning we need *another* config setting to prevent excessive bloat on a heavily updated table on the master. -- Richard Huxton Archonet Ltd
On 26/02/10 14:10, Heikki Linnakangas wrote: > > Ideally the standby would stash away the old pages or tuples somewhere > so that it can still access them even after replaying the WAL records > that remove them from the main storage. I realize that's not going to > happen any time soon because it's hard to do, but that would really be > the most robust fix possible. Something like snapshotting a filesystem, so updates continue while you're still looking at a static version. -- Richard Huxton Archonet Ltd
Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Richard Huxton wrote: > On 26/02/10 08:33, Greg Smith wrote: >> I'm not sure what you might be expecting from the above combination, but >> what actually happens is that many of the SELECT statements on the table >> *that isn't even being updated* are canceled. You see this in the logs: > > Hmm - this I'd already figured out for myself. It's just occurred to me > that this could well be the case between databases too. Database A gets > vacuumed, B gets its queries kicked off on the standby. No, it's per-database already. Only queries in the same database are canceled. > Dumb non-hacker question: why do we cancel all transactions rather than > just those with "ACCESS SHARE" on the vacuumed table in question? Is it > the simple fact that we don't know what table this particular section of > WAL affects, or is it the complexity of tracking all this info? The problem is that even if transaction X doesn't have an (access share) lock on the vacuumed table at the moment, it might take one in the future. Simon proposed mechanisms for storing the information about vacuumed tables in shared memory, so that if X takes the lock later on it will get canceled at that point, but that's 9.1 material. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 26/02/10 14:45, Heikki Linnakangas wrote: > Richard Huxton wrote: >> On 26/02/10 08:33, Greg Smith wrote: >>> I'm not sure what you might be expecting from the above combination, but >>> what actually happens is that many of the SELECT statements on the table >>> *that isn't even being updated* are canceled. You see this in the logs: >> >> Hmm - this I'd already figured out for myself. It's just occurred to me >> that this could well be the case between databases too. Database A gets >> vacuumed, B gets its queries kicked off on the standby. > > No, it's per-database already. Only queries in the same database are > canceled. That's a relief. >> Dumb non-hacker question: why do we cancel all transactions rather than >> just those with "ACCESS SHARE" on the vacuumed table in question? Is it >> the simple fact that we don't know what table this particular section of >> WAL affects, or is it the complexity of tracking all this info? > > The problem is that even if transaction X doesn't have an (access share) > lock on the vacuumed table at the moment, it might take one in the > future. Simon proposed mechanisms for storing the information about > vacuumed tables in shared memory, so that if X takes the lock later on > it will get canceled at that point, but that's 9.1 material. I see - we'd need to age the list of vacuumed tables too, so when the oldest transactions complete the correct flags get cleared. Can we not wait to cancel the transaction until *any* new lock is attempted though? That should protect all the single-statement long-running transactions that are already underway. Aggregates etc. -- Richard Huxton Archonet Ltd
Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Richard Huxton wrote: > Can we not wait to cancel the transaction until *any* new lock is > attempted though? That should protect all the single-statement > long-running transactions that are already underway. Aggregates etc. Hmm, that's an interesting thought. You'll still need to somehow tell the victim backend "you have to fail if you try to acquire any more locks", but a single per-backend flag in the procarray would suffice. You could also clear the flag whenever you free the last snapshot in the transaction (ie. between each query in read committed mode). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Replying to my own post - first sign of madness... Let's see if I've got the concepts clear here, and hopefully my thinking it through will help others reading the archives. There are two queues: 1. Cleanup on the master 2. Replay on the slave Running write queries on the master adds to both queues. Running (read-only) queries on the slave prevents you removing from both queues. There are two interesting measurements of "age"/"size": 1. Oldest item in / length of queue (knowable) 2. How long will it take to clear the queue (estimable at best) You'd like to know #2 to keep up with your workload. Unfortunately, you can't for certain unless you have control over new incoming queries (on both master and slave). You might want four separate GUCs for the two measurements on the two queues. We currently have two that (sort of) match #1 "Oldest item" (vacuum_defer_cleanup_age, max_standby_delay). Delaying replay on a slave has no effect on the master. If a slave falls too far behind it's responsible for catch-up (via normal WAL archives). There is no point in delaying cleanup on the master unless it's going to help one or more slaves. In fact, you don't want to start delaying cleanup until you have to, otherwise you're wasting your delay time. This seems to be the case with vacuum_defer_cleanup_age. If I have a heavily-updated table and I defer vacuuming then when any given query starts on the slave it's going to be half used up already. There's also no point in deferring cleanup on the master if the standby is already waiting on a conflict that will cause its queries to be cancelled anyway. Not only won't it help, but it might make things worse since transactions will be cancelled, the conflict will be replayed and (presumably) queries will be re-submitted only to be cancelled again. This is what Greg Smith's discussion of the keep-alives was about. Giving the master enough information to be smarter about cleanup (and making the conflicts more fine-grained). The situation with deferring on one or both ends of process just gets more complicated with multiple slaves. There's all sorts of unpleasant feedback loops I can envisage there. For the case of single slave being used to run long reporting queries the ideal scenario would be the following. Master starts deferring vacuum activity just before the query starts. When that times out, the slave will receive the cleanup info, refuse to replay it and start its delay. This gives you a total available query time of: natural time between vacuums + vacuum delay + WAL transfer time + standby delay I can think of five useful things we should be doing (and might be already - don't know). 1. On the master, deduce whether the slave is already waiting on a query. If so, don't bother delaying cleanup. Clearly you don't want to be signalling hundreds of times a second though. Does the slave pause fetching via streaming replication if replay is blocked on a query? Could we signal "half-way to max-age" or some such? 2. Perhaps simpler than trying to make the master smarter, just allow SET this_transaction_is_probably_a_long_one=true on the slave. That (a) clears the queue on the slave and (b) sends the signal to the master which then starts deferring vacuum. 3. Do a burst of cleanup activity on the master after blocking. This should concentrate conflicts together when they reach the slave. Perhaps vacuum_defer_cleanup_age should be vacuum_deferred_queue_size and measure the amount of work to do, rather than the max age of the oldest cleanup (if I've understood correctly). 4. Do a burst of replay on the slave after blocking. Perhaps every time it cancels a transaction it should replay at least half the queued WAL before letting new transactions start. Or perhaps it replays any vacuum activity it comes across and then stops. That should sync with #2 assuming the slave doesn't lag the master too much. 5. I've been mixing "defer" and "delay", as do the docs. We should probably settle on one or the other. I think defer conveys the meaning more precisely, but what about non-native English speakers? -- Richard Huxton Archonet Ltd
On Fri, Feb 26, 2010 at 10:21 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Richard Huxton wrote: >> Can we not wait to cancel the transaction until *any* new lock is >> attempted though? That should protect all the single-statement >> long-running transactions that are already underway. Aggregates etc. > > Hmm, that's an interesting thought. You'll still need to somehow tell > the victim backend "you have to fail if you try to acquire any more > locks", but a single per-backend flag in the procarray would suffice. > > You could also clear the flag whenever you free the last snapshot in the > transaction (ie. between each query in read committed mode). Wow, that seems like it would help a lot. Although I'm not 100% sure I follow all the details of how this works. ...Robert
On Fri, Feb 26, 2010 at 4:43 PM, Richard Huxton <dev@archonet.com> wrote: > Let's see if I've got the concepts clear here, and hopefully my thinking it > through will help others reading the archives. > > There are two queues: I don't see two queues. I only see the one queue of operations which have been executed on the master but not replayed yet on the slave. Every write operation on the master enqueues an operation to it and every operation replayed on the slave dequeues from it. Only a subset of operations create conflicts with concurrent transactions on the slave, namely vacuums and a few similar operations (HOT pruning and btree index pruning). There's no question we need to make sure users have good tools to monitor this queue and are aware of these tools. You can query each slave for its currently replayed log position and hopefully you can find out how long it's been delayed (ie, if it's looking at a log record and waiting for a conflict to clear how long ago that log record was generated). You can also find out what the log position is on the master. -- greg
Heikki Linnakangas wrote: > > How to handle situations where the standby goes away for a while, > > such as a network outage, so that it doesn't block the master from ever > > cleaning up dead tuples is a concern. > > Yeah, that's another issue that needs to be dealt with. You'd probably > need some kind of a configurable escape valve in the master, to let it > ignore a standby's snapshot once it gets too old. > > > But I do know that the current Hot Standby implementation is going to be > > frustrating to configure correctly for people. > > Perhaps others who are not as deep into the code as I am will have a > better view on this, but I seriously don't think that's such a big > issue. I think the max_standby_delay setting is quite intuitive and easy > to explain. Sure, it would better if there was no tradeoff between > killing queries and stalling recovery, but I don't think it'll be that > hard to understand the tradeoff. Let's look at the five documented cases of query conflict (from our manual): 1 Access Exclusive Locks from primary node, including both explicit LOCK commands and various DDL actions 2 Dropping tablespaces on the primary while standby queries are using those tablespaces for temporary work files (work_mem overflow) 3 Dropping databases on the primary while users are connected to that database on the standby. 4 The standby waiting longer than max_standby_delay to acquire a buffer cleanup lock. 5 Early cleanupof data still visible to the current query's snapshot We might have a solution to #1 by only cancelling queries that try to take locks. #2 and #3 seem like rare occurances. #4 can be controlled by max_standby_delay, where a large value only delays playback during crash recovery --- again, a rare occurance. #5 could be handled by using vacuum_defer_cleanup_age on the master. Why is vacuum_defer_cleanup_age not listed in postgresql.conf? In summary, I think passing snapshots to the master is not something possible for 9.0, and ideally we will never need to add that feature. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Greg Stark <gsstark@mit.edu> writes: > In the model you describe any long-lived queries on the slave cause > tables in the master to bloat with dead records. Yup, same as they would do on the master. > I think this model is on the roadmap but it's not appropriate for > everyone and I think one of the benefits of having delayed it is that > it forces us to get the independent model right before throwing in > extra complications. It would be too easy to rely on the slave > feedback as an answer for hard questions about usability if we had it > and just ignore the question of what to do when it's not the right > solution for the user. I'm going to make an unvarnished assertion here. I believe that the notion of synchronizing the WAL stream against slave queries is fundamentally wrong and we will never be able to make it work. The information needed isn't available in the log stream and can't be made available without very large additions (and consequent performance penalties). As we start getting actual beta testing we are going to uncover all sorts of missed cases that are not going to be fixable without piling additional ugly kluges on top of the ones Simon has already crammed into the system. Performance and reliability will both suffer. I think that what we are going to have to do before we can ship 9.0 is rip all of that stuff out and replace it with the sort of closed-loop synchronization Greg Smith is pushing. It will probably be several months before everyone is forced to accept that, which is why 9.0 is not going to ship this year. regards, tom lane
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Tom Lane wrote: > I'm going to make an unvarnished assertion here. I believe that the > notion of synchronizing the WAL stream against slave queries is > fundamentally wrong and we will never be able to make it work. > The information needed isn't available in the log stream and can't be > made available without very large additions (and consequent performance > penalties). As we start getting actual beta testing we are going to > uncover all sorts of missed cases that are not going to be fixable > without piling additional ugly kluges on top of the ones Simon has > already crammed into the system. Performance and reliability will both > suffer. > > I think that what we are going to have to do before we can ship 9.0 > is rip all of that stuff out and replace it with the sort of closed-loop > synchronization Greg Smith is pushing. It will probably be several > months before everyone is forced to accept that, which is why 9.0 is > not going to ship this year. Wow, can I have some varnish with that. :-O You are right that we need to go down the road a bit before we know what we need for 9.0 or 9.1. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
On 2/26/10 6:57 AM, Richard Huxton wrote: > > Can we not wait to cancel the transaction until *any* new lock is > attempted though? That should protect all the single-statement > long-running transactions that are already underway. Aggregates etc. I like this approach. Is it fragile in some non-obvious way? --Josh Berkus
On 2/26/10 10:53 AM, Tom Lane wrote: > I think that what we are going to have to do before we can ship 9.0 > is rip all of that stuff out and replace it with the sort of closed-loop > synchronization Greg Smith is pushing. It will probably be several > months before everyone is forced to accept that, which is why 9.0 is > not going to ship this year. I don't think that publishing visibility info back to the master ... and subsequently burdening the master substantially for each additional slave ... are what most users want. Certainly for use cases like NTT's, it is, but not for most of our users. In fact, I seem to remember specifically discussing the approach of trying to publish snapshots back to the master, and rejecting it on this list during the development of SR. Does anyone know how Oracle solves these issues? Does their structure (separate rollback log) make it easier for them? --Josh Berkus
Josh Berkus <josh@agliodbs.com> writes: > On 2/26/10 10:53 AM, Tom Lane wrote: >> I think that what we are going to have to do before we can ship 9.0 >> is rip all of that stuff out and replace it with the sort of closed-loop >> synchronization Greg Smith is pushing. It will probably be several >> months before everyone is forced to accept that, which is why 9.0 is >> not going to ship this year. > I don't think that publishing visibility info back to the master ... and > subsequently burdening the master substantially for each additional > slave ... are what most users want. I don't see a "substantial additional burden" there. What I would imagine is needed is that the slave transmits a single number back --- its current oldest xmin --- and the walsender process publishes that number as its transaction xmin in its PGPROC entry on the master. I don't doubt that this approach will have its own gotchas that we find as we get into it. But it looks soluble. I have no faith in either the correctness or the usability of the approach currently being pursued. regards, tom lane
> I don't see a "substantial additional burden" there. What I would > imagine is needed is that the slave transmits a single number back > --- its current oldest xmin --- and the walsender process publishes > that number as its transaction xmin in its PGPROC entry on the master. If the main purpose of the slave is long-running queries, though, this could cause a lot of bloat on the master. That's a special case, but a reason why we would want to preserve the stop replication functionality. > I don't doubt that this approach will have its own gotchas that we > find as we get into it. But it looks soluble. I have no faith in > either the correctness or the usability of the approach currently > being pursued. So, why not start working on it now, instead of arguing about it? It'll be easy to prove the approach once we have some test code.
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 2/26/10 10:53 AM, Tom Lane wrote: >>> I think that what we are going to have to do before we can ship 9.0 >>> is rip all of that stuff out and replace it with the sort of closed-loop >>> synchronization Greg Smith is pushing. It will probably be several >>> months before everyone is forced to accept that, which is why 9.0 is >>> not going to ship this year. > >> I don't think that publishing visibility info back to the master ... and >> subsequently burdening the master substantially for each additional >> slave ... are what most users want. > > I don't see a "substantial additional burden" there. What I would > imagine is needed is that the slave transmits a single number back > --- its current oldest xmin --- and the walsender process publishes > that number as its transaction xmin in its PGPROC entry on the master. The additional burden comes from the old snapshot effect. It makes it unusable for offloading reporting queries, for example. In general, it is a very good architectural property that the master is not affected by what happens in a standby, and a closed-loop synchronization would break that. I don't actually understand how tight synchronization on its own would solve the problem. What if the connection to the master is lost? Do you kill all queries in the standby before reconnecting? One way to think about this is to first consider a simple a stop-and-go system. Clearly the database must be consistent at any point in the WAL sequence, if recovery was stopped and the database started up. So it is always safe to pause recovery and run a read-only query against the database as it is at that point in time (this assumes that the index "cleanup" operations are not required for consistent query results BTW). After the read-only transaction is finished, you can continue recovery. The next step up is to relax that so that you allow replay of those WAL records that are known to not cause trouble to the read-only queries. For example, heap_insert records are very innocent, they only add rows with a yet-uncommitted xmin. Things get more complex when you allow the replay of commit records; all the known-assigned-xids tracking is related to that, so that transactions that are not committed when a snapshot is taken in the standby to be considered uncommitted by the snapshot even after the commit record is later replayed. If that feels too fragile, there might be other methods to achieve that. One I once pondered is to not track all in-progress transactions in shared memory like we do now, but only OldestXmin. When a backend wants to take a snapshot in the slave, it memcpy()s clog from OldestXmin to the latest committed XID, and includes it in the snapshot. The visibility checks use the copy instead of the actual clog, so they see the situation as it was when the snapshot was taken. To keep track of the OldestXmin in the slave, the master can emit that as a WAL record every now and then; it's ok if it lags behind. Then there's the WAL record types that remove data that might still be required by the read-only transactions. This includes vacuum and index deletion records. If you really think the current approach is unworkable, I'd suggest that we fall back to a stop-and-go system, where you either let the recovery to progress or allow queries to run, but not both at the same time. But FWIW I don't think the situation is that grave. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't see a "substantial additional burden" there. What I would > imagine is needed is that the slave transmits a single number back > --- its current oldest xmin --- and the walsender process publishes > that number as its transaction xmin in its PGPROC entry on the master. And when we want to support cascading slaves? Or when you want to bring up a new slave and it suddenly starts advertising a new xmin that's older than the current oldestxmin? But in any case if I were running a reporting database I would want it to just stop replaying logs for a few hours while my big batch report runs, not cause the master to be unable to vacuum any dead records for hours. That defeats much of the purpose of running the queries on the slave. -- greg
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I don't actually understand how tight synchronization on its own would > solve the problem. What if the connection to the master is lost? Do you > kill all queries in the standby before reconnecting? Sure. So what? They'd have been killed if they individually lost connections to the master (or the slave), too. > [ assorted analysis based on WAL contents ] The problem is all the interactions that are not reflected (historically anyway) to WAL. We already know about btree page reclamation interlocks and relcache init files. How many others are there, and how messy and expensive is it going to be to deal with them? > If you really think the current approach is unworkable, I'd suggest that > we fall back to a stop-and-go system, where you either let the recovery > to progress or allow queries to run, but not both at the same time. But > FWIW I don't think the situation is that grave. I might be wrong. I hope for the sake of the project schedule that I am wrong. But I'm afraid that we will spend several months beavering away to try to make the current approach solid and user-friendly, and eventually conclude that it's a dead end. It would be prudent to have a Plan B; and it looks to me like closed-loop synchronization is the best Plan B. Putting off all thought about it for the next release cycle seems like a recipe for a scheduling disaster. regards, tom lane
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
"Joshua D. Drake"
Date:
On Fri, 2010-02-26 at 12:02 -0800, Josh Berkus wrote: > > I don't see a "substantial additional burden" there. What I would > > imagine is needed is that the slave transmits a single number back > > --- its current oldest xmin --- and the walsender process publishes > > that number as its transaction xmin in its PGPROC entry on the master. > > If the main purpose of the slave is long-running queries, though, this > could cause a lot of bloat on the master. That's a special case, but a > reason why we would want to preserve the stop replication functionality. > Do we really think that users, using the slave to run long-running queries is a special case? One of the number one things I can see this being used for is reporting.... Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Aidan Van Dyk
Date:
* Greg Stark <gsstark@mit.edu> [100226 15:10]: > On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I don't see a "substantial additional burden" there. What I would > > imagine is needed is that the slave transmits a single number back > > --- its current oldest xmin --- and the walsender process publishes > > that number as its transaction xmin in its PGPROC entry on the master. > > And when we want to support cascading slaves? > > Or when you want to bring up a new slave and it suddenly starts > advertising a new xmin that's older than the current oldestxmin? > > But in any case if I were running a reporting database I would want it > to just stop replaying logs for a few hours while my big batch report > runs, not cause the master to be unable to vacuum any dead records for > hours. That defeats much of the purpose of running the queries on the > slave. *I* would be quite happy having the stop--and-go and the closed-loop be the only 2 modes of operation, and I'ld even be quite happy if the were both limited to separate method: 1) Running SR - then you are forced to use a closed-loop 2) Running HS from a backup/archive - forced to use stop-n-go #1 still needs to deal ith a slave "disappearing" and not advancing xmin for a period (TCP timeout)? I'll note that until SR does synchronous streaming rep (which will likely require some close-loop plan to allow the slave to be hot), I want situation #2, and hopefully the knob to control how long it allows a "stop" before going again can be a HUP'able knob so I can change it occasionally without taking the server down... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Greg Stark <gsstark@mit.edu> writes: > On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't see a "substantial additional burden" there. �What I would >> imagine is needed is that the slave transmits a single number back >> --- its current oldest xmin --- and the walsender process publishes >> that number as its transaction xmin in its PGPROC entry on the master. > And when we want to support cascading slaves? So? Fits right in. The walsender on the first-level slave is advertising an xmin from the second-level one, which will be included in what's passed back up to the master. > Or when you want to bring up a new slave and it suddenly starts > advertising a new xmin that's older than the current oldestxmin? How's it going to do that, when it has no queries at the instant of startup? > But in any case if I were running a reporting database I would want it > to just stop replaying logs for a few hours while my big batch report > runs, not cause the master to be unable to vacuum any dead records for > hours. That defeats much of the purpose of running the queries on the > slave. Well, as Heikki said, a stop-and-go WAL management approach could deal with that use-case. What I'm concerned about here is the complexity, reliability, maintainability of trying to interlock WAL application with slave queries in any sort of fine-grained fashion. regards, tom lane
> Well, as Heikki said, a stop-and-go WAL management approach could deal > with that use-case. What I'm concerned about here is the complexity, > reliability, maintainability of trying to interlock WAL application with > slave queries in any sort of fine-grained fashion. This sounds a bit brute-force, but what about simply having some form of automatic query retry on the slave? --Josh Berkus
On Fri, Feb 26, 2010 at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > How's it going to do that, when it has no queries at the instant > of startup? > Why shouldn't it have any queries at walreceiver startup? It has any xlog segments that were copied from the master and any it can find in the archive, it could easily reach a consistent point long before it needs to connect to the master. If you really want to protect your master from any additional overhead you don't currently need to configure a streaming connection at all, you can just use the file shipping interface. -- greg
Greg Stark <gsstark@mit.edu> writes: > Why shouldn't it have any queries at walreceiver startup? It has any > xlog segments that were copied from the master and any it can find in > the archive, it could easily reach a consistent point long before it > needs to connect to the master. If you really want to protect your > master from any additional overhead you don't currently need to > configure a streaming connection at all, you can just use the file > shipping interface. There's *definitely* not going to be enough information in the WAL stream coming from a master that doesn't think it has HS slaves. We can't afford to record all that extra stuff in installations for which it's just useless overhead. BTW, has anyone made any attempt to measure the performance hit that the patch in its current form is creating via added WAL entries? regards, tom lane
On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There's *definitely* not going to be enough information in the WAL > stream coming from a master that doesn't think it has HS slaves. > We can't afford to record all that extra stuff in installations for > which it's just useless overhead. BTW, has anyone made any attempt > to measure the performance hit that the patch in its current form is > creating via added WAL entries? What extra entries? -- greg
Re: Hot Standby query cancellation and Streaming Replication integration
From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > Well, as Heikki said, a stop-and-go WAL management approach could deal > with that use-case. What I'm concerned about here is the complexity, > reliability, maintainability of trying to interlock WAL application with > slave queries in any sort of fine-grained fashion. Some admin functions for Hot Standby were removed from the path to ease its integration, there was a pause() and resume() feature. I think that offering this explicit control to the user would allow them to choose between HA setup and reporting setup easily enough: just pause the replay when running the reporting, resume it to get fresh data again. If you don't pause, any query can get killed, replay is the priority. Now as far as the feedback loop is concerned, I guess the pause() function would cause the slave to stop publishing any xmin in the master's procarray so that it's free to vacuum and archive whatever it wants to. Should the slave accumulate too much lag, it will resume from the archive rather than live from the SR link. How much that helps? Regards, -- dim
Greg Stark <gsstark@mit.edu> writes: > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There's *definitely* not going to be enough information in the WAL >> stream coming from a master that doesn't think it has HS slaves. >> We can't afford to record all that extra stuff in installations for >> which it's just useless overhead. �BTW, has anyone made any attempt >> to measure the performance hit that the patch in its current form is >> creating via added WAL entries? > What extra entries? Locks, just for starters. I haven't read enough of the code yet to know what else Simon added. In the past it's not been necessary to record any transient information in WAL, but now we'll have to. regards, tom lane
Dimitri Fontaine wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > Well, as Heikki said, a stop-and-go WAL management approach could deal > > with that use-case. What I'm concerned about here is the complexity, > > reliability, maintainability of trying to interlock WAL application with > > slave queries in any sort of fine-grained fashion. > > Some admin functions for Hot Standby were removed from the path to ease > its integration, there was a pause() and resume() feature. > > I think that offering this explicit control to the user would allow them > to choose between HA setup and reporting setup easily enough: just pause > the replay when running the reporting, resume it to get fresh data > again. If you don't pause, any query can get killed, replay is the > priority. Doesn't the system already adjust the delay based on the length of slave transactions, e.g. max_standby_delay. It seems there is no need for a user switch --- just max_standby_delay really high. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Re: Hot Standby query cancellation and Streaming Replication integration
From
Dimitri Fontaine
Date:
Bruce Momjian <bruce@momjian.us> writes: > Doesn't the system already adjust the delay based on the length of slave > transactions, e.g. max_standby_delay. It seems there is no need for a > user switch --- just max_standby_delay really high. Well that GUC looks like it allows to set a compromise between HA and reporting, not to say "do not ever give the priority to the replay while I'm running my reports". At least that's how I understand it. The feedback loop might get expensive on master server when running reporting queries on the slave, unless you can "pause" it explicitly I think. I don't see how the system will guess that you're running a reporting server rather than a HA node, and max_standby_delay is just a way to tell the standby to please be nice in case of abuse. Regards, -- dim
Dimitri Fontaine wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Doesn't the system already adjust the delay based on the length of slave > > transactions, e.g. max_standby_delay. It seems there is no need for a > > user switch --- just max_standby_delay really high. > > Well that GUC looks like it allows to set a compromise between HA and > reporting, not to say "do not ever give the priority to the replay while > I'm running my reports". At least that's how I understand it. Well, if you set it high, it effectively is that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Tom Lane wrote: > I don't see a "substantial additional burden" there. What I would > imagine is needed is that the slave transmits a single number back > --- its current oldest xmin --- and the walsender process publishes > that number as its transaction xmin in its PGPROC entry on the master. > That is exactly the core idea I was trying to suggest in my rambling message. Just that small additional bit of information transmitted and published to the master via that route, and it's possible to optimize this problem in a way not available now. And it's a way that I believe will feel more natural to some users who may not be well served by any of the existing tuning possibilities. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Bruce Momjian wrote: > Doesn't the system already adjust the delay based on the length of slave > transactions, e.g. max_standby_delay. It seems there is no need for a > user switch --- just max_standby_delay really high. > The first issue is that you're basically saying "I don't care about high availability anymore" when you increase max_standby_delay to a high value. Want to offload an 8 hour long batch report every day to the standby? You can do it with max_standby_delay=8 hours. But the day your master crashes 7 hours into that, you're in for a long wait before your standby is available while it replays all the queued up segments. Your 'hot standby' has actually turned into the old form of 'cold standby' just when you need it to be responsive. This is also the reason why the whole "pause recovery" idea is a fruitless path to wander down. The whole point of this feature is that people have a secondary server available for high-availability, *first and foremost*, but they'd like it to do something more interesting that leave it idle all the time. The idea that you can hold off on applying standby updates for long enough to run seriously long reports is completely at odds with the idea of high-availability. The second major problem is that the day the report actually takes 8.1 hours instead, because somebody else ran another report that slowed you down a little, you're screwed if that's something you depend on being available--it just got canceled only *after* wasting 8 hours of reporting resource time. max_standby_delay is IMHO only useful for allowing non-real-time web-app style uses of HS (think "Facebook status updates"), where you say "I'm OK giving people slightly out of date info sometimes if it lets me split the query load over two systems". Set max_standby_delay to a few seconds or maybe a minute, enough time to service a typical user query, make your app tolerate the occasional failed query and only send big ones to the master, and you've just scaled up all the small ones. Distributed queries with "eventual consistency" on all nodes is where many of the web app designs are going, and this feature is a reasonable match for that use case. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Bruce Momjian wrote: > 5 Early cleanup of data still visible to the current query's > snapshot > > #5 could be handled by using vacuum_defer_cleanup_age on the master. > Why is vacuum_defer_cleanup_age not listed in postgresql.conf? > I noticed that myself and fired off a corrective patch to Simon yesterday, he said it was intentional but not sure why that is yet. We'll sort that out. You are correct that my suggestion is targeting primarily #5 on this list. There are two problems with the possible solutions using that parameter though: -vacuum_defer_cleanup_age is set in a unit that people cannot be expected to work in--transactions ids. The UI is essentially useless, and there's no obvious way how to make a better one. The best you can do will still be really fragile. -If you increase vacuum_defer_cleanup_age, it's active all the time. You're basically making every single transaction that could be cleaned up pay for the fact that a query *might* be running on the standby it needs to avoid. You can think of the idea of passing an xmin back from the standby as being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no standby queries are running, but grows in size to match longer ones. And you don't have to have to know anything to set it correctly; just toggle on the proposed "feedback xid from the standby" feature and you're safe. Expecting that anyone will ever set vacuum_defer_cleanup_age correctly in the field in its current form is pretty unreasonable I think. Since there's no timestamp-based memory of past xid activity, it's difficult to convert it to that form instead, and I think something in terms of time is what people would like to set this in. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> That is exactly the core idea I was trying to suggest in my rambling > message. Just that small additional bit of information transmitted and > published to the master via that route, and it's possible to optimize > this problem in a way not available now. And it's a way that I believe > will feel more natural to some users who may not be well served by any > of the existing tuning possibilities. Well, if both you and Tom think it would be relatively easy (or at least easier that continuing to pursue query cancel troubleshooting), then please start coding it. It was always a possible approach, we just collectively thought that query cancel would be easier. --Josh Berkus
Greg Smith wrote: > Bruce Momjian wrote: > > Doesn't the system already adjust the delay based on the length of slave > > transactions, e.g. max_standby_delay. It seems there is no need for a > > user switch --- just max_standby_delay really high. > > > > The first issue is that you're basically saying "I don't care about high > availability anymore" when you increase max_standby_delay to a high > value. Want to offload an 8 hour long batch report every day to the > standby? You can do it with max_standby_delay=8 hours. But the day > your master crashes 7 hours into that, you're in for a long wait before > your standby is available while it replays all the queued up segments. > Your 'hot standby' has actually turned into the old form of 'cold > standby' just when you need it to be responsive. Well, I think the choice is either you delay vacuum on the master for 8 hours or pile up 8 hours of WAL files on the slave, and delay application, and make recovery much slower. It is not clear to me which option a user would prefer because the bloat on the master might be permanent. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith <greg@2ndquadrant.com> wrote: > This is also the reason why the whole "pause recovery" idea is a fruitless > path to wander down. The whole point of this feature is that people have a > secondary server available for high-availability, *first and foremost*, but > they'd like it to do something more interesting that leave it idle all the > time. The idea that you can hold off on applying standby updates for long > enough to run seriously long reports is completely at odds with the idea of > high-availability. Well you can go sit in the same corner as Simon with your high availability servers. I want my ability to run large batch queries without any performance or reliability impact on the primary server. You can have one or the other but you can't get both. If you set max_standby_delay low then you get your high availability server, if you set it high you get a useful report server. If you build sync replication which we don't have today and which will open another huge can of usability worms when we haven't even finish bottling the two we've already opened then you lose the lack of impact on the primary. Suddenly the queries you run on the slaves cause your production database to bloat. Plus you have extra network connections which take resources on your master and have to be kept up at all times or you lose your slaves. I think the design constraint of not allowing any upstream data flow is actually very valuable. Eventually we'll have it for sync replication but it's much better that we've built things incrementally and can be sure that nothing really depends on it for basic functionality. This is what allows us to know that the slave imposes no reliability impact on the master. It's what allows us to know that everything will work identically regardless of whether you have a walreceiver running or are running off archived log files. Remember I wanted to entirely abstract away the walreciever and allow multiple wal communication methods. I think it would make more sense to use something like Spread to distribute the logs so the master only has to send them once and as many slaves as you want can pick them up. The current architecture doesn't scale very well if you want to have hundreds of slaves for one master. -- greg
Greg Smith wrote: > You can think of the idea of passing an xmin back from the standby as > being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no > standby queries are running, but grows in size to match longer ones. And > you don't have to have to know anything to set it correctly; just toggle > on the proposed "feedback xid from the standby" feature and you're safe. Yes, there is no question there is value in passing the xid back to the master. My point is that it is like your blog entry: http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html you can't have all the options: 1. agressive vacuum on master2. fast recovery of slave3. no query cancel on slave Again, pick two. Passing the xid back to the master gives you #2 and #3, and that might be good for some people, but not others. Do we have any idea which options most administrators would want? If we get xid passback to the master, do we keep the other configuration options? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Greg Stark wrote: > Well you can go sit in the same corner as Simon with your high > availability servers. > > I want my ability to run large batch queries without any performance > or reliability impact on the primary server. > Thank you for combining a small personal attack with a selfish commentary about how yours is the only valid viewpoint. Saves me a lot of trouble replying to your messages, can just ignore them instead if this is how you're going to act. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sat, Feb 27, 2010 at 1:53 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Greg Stark wrote: >> >> Well you can go sit in the same corner as Simon with your high >> availability servers. >> >> I want my ability to run large batch queries without any performance >> or reliability impact on the primary server. >> > > Thank you for combining a small personal attack with a selfish commentary > about how yours is the only valid viewpoint. Saves me a lot of trouble > replying to your messages, can just ignore them instead if this is how > you're going to act. Eh? That's not what I meant at all. Actually it's kind of the exact opposite of what I meant. What I meant was that your description of the "High Availability first and foremost" is only one possible use case. Simon in the past expressed the same single-minded focus on that use case. It's a perfectly valid use case and I would probably agree if we had to choose just one it would be the most important. But we don't have to choose just one. There are other valid use cases such as load balancing and isolating your large batch queries from your production systems. I don't want us to throw out all these other use cases because we only consider high availability as the only use case we're interested in. -- greg
Bruce Momjian wrote: > Well, I think the choice is either you delay vacuum on the master for 8 > hours or pile up 8 hours of WAL files on the slave, and delay > application, and make recovery much slower. It is not clear to me which > option a user would prefer because the bloat on the master might be > permanent. > But if you're running the 8 hour report on the master right now, aren't you already exposed to a similar pile of bloat issues while it's going? If I have the choice between "sometimes queries will get canceled" vs. "sometimes the master will experience the same long-running transaction bloat issues as in earlier versions even if the query runs on the standby", I feel like leaning toward the latter at least leads to a problem people are used to. This falls into the principle of least astonishment category to me. Testing the final design for how transactions get canceled here led me to some really unexpected situations, and the downside for a mistake is "your query is lost". Had I instead discovered that sometimes long-running transactions on the standby can ripple back to cause a maintenance slowdown on the master, that's not great. But it would not have been so surprising, and it won't result in lost query results. I think people will expect that their queries cancel because of things like DDL changes. And the existing knobs allow inserting some slack for things like locks taking a little bit of time to acquire sometimes. What I don't think people will see coming is that a routine update on an unrelated table is going to kill a query they might have been waiting hours for the result of, just because that update crossed an autovacuum threshold for the other table and introduced a dead row cleanup. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Heikki Linnakangas wrote: > One such landmine is that the keepalives need to flow from client to > server while the WAL records are flowing from server to client. We'll > have to crack that problem for synchronous replication too, but I think > that alone is a big enough problem to make this 9.1 material. > This seems to be the real sticking point then, given that the xmin/PGPROC side on the master seems logically straightforward. For some reason I thought the sync rep feature had the reverse message flow already going, and that some other sort of limitation just made it impractical to merge into the main codebase this early. My hope was that just this particular part could get cherry-picked out of there, and that it might even have been thought about already in that context given the known HS keepalive "serious issue". If there was a solution or partial solution in progress to that floating around, my thought was that just piggybacking this extra xid info on top of it would be easy enough. If there's not already a standby to primary communications backchannel implementation available that can be harvested from that work, your suggestion that this may not be feasible at all for 9.0 seems like a more serious concern than I had thought it was going to be. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > Heikki Linnakangas wrote: > > One such landmine is that the keepalives need to flow from client to > > server while the WAL records are flowing from server to client. We'll > > have to crack that problem for synchronous replication too, but I think > > that alone is a big enough problem to make this 9.1 material. > > > > This seems to be the real sticking point then, given that the > xmin/PGPROC side on the master seems logically straightforward. For > some reason I thought the sync rep feature had the reverse message flow > already going, and that some other sort of limitation just made it > impractical to merge into the main codebase this early. My hope was > that just this particular part could get cherry-picked out of there, and > that it might even have been thought about already in that context given > the known HS keepalive "serious issue". If there was a solution or > partial solution in progress to that floating around, my thought was > that just piggybacking this extra xid info on top of it would be easy > enough. > > If there's not already a standby to primary communications backchannel > implementation available that can be harvested from that work, your > suggestion that this may not be feasible at all for 9.0 seems like a > more serious concern than I had thought it was going to be. I suspect the master could connect to the slave to pull an xid. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Greg Stark wrote: > Eh? That's not what I meant at all. Actually it's kind of the exact > opposite of what I meant. > Sorry about that--I think we just hit one of those language usage drift bits of confusion. "Sit in the corner" has a very negative tone to it in US English and I interpreted your message badly as a result. A Google search for images using that phrase will quickly show you what I mean. > What I meant was that your description of the "High Availability first > and foremost" is only one possible use case. Simon in the past > expressed the same single-minded focus on that use case. It's a > perfectly valid use case and I would probably agree if we had to > choose just one it would be the most important. > Sure, there are certainly others, and as much as possible more flexibility here is a good thing. What I was suggesting is that if the only good way to handle long-running queries has no choice but to sacrifice high-availability, which is is the situation if max_standby_delay is the approach you use, then the most obvious users for this feature are not being well served by that situation. I would guess a large portion of the users looking forward to Hot Standby are in the "have an underutilized high-availability standby I'd like to use for offloading long running reports", and if there is no way to serve them well this feature is missing the mark a bit. You really can't do any better without better master/standby integration though, and as pointed out a couple of times here that was considered and just not followed through on yet. I'm increasingly concerned that nothing else will really do though. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Re: Hot Standby query cancellation and StreamingReplication integration
From
"Joshua D. Drake"
Date:
On Fri, 2010-02-26 at 12:02 -0800, Josh Berkus wrote: > > I don't see a "substantial additional burden" there. What I would > > imagine is needed is that the slave transmits a single number back > > --- its current oldest xmin --- and the walsender process publishes > > that number as its transaction xmin in its PGPROC entry on the master. > > If the main purpose of the slave is long-running queries, though, this > could cause a lot of bloat on the master. That's a special case, but a > reason why we would want to preserve the stop replication functionality. > Do we really think that users, using the slave to run long-running queries is a special case? One of the number one things I can see this being used for is reporting.... Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Sat, Feb 27, 2010 at 2:43 AM, Greg Smith <greg@2ndquadrant.com> wrote: > > But if you're running the 8 hour report on the master right now, aren't you > already exposed to a similar pile of bloat issues while it's going? If I > have the choice between "sometimes queries will get canceled" vs. "sometimes > the master will experience the same long-running transaction bloat issues as > in earlier versions even if the query runs on the standby", I feel like > leaning toward the latter at least leads to a problem people are used to. If they move from running these batch queries on the master to running them on the slave then sure, the situation will be no worse than before. But if they move from having a plain old PITR warm standby to having one they can run queries on they might well assume that the big advantage of having the standby to play with is precisely that they can do things there that they have never been able to do on the master previously without causing damage. I agree that having queries randomly and unpredictably canceled is pretty awful. My argument was that max_standby_delay should default to infinity on the basis that any other value has to be picked based on actual workloads and SLAs. My feeling is that there are probably only two types of configurations that make sense, a HA replica with a low max_standby_delay or a reporting replica with a high max_standby_delay. Any attempt to combine the two on the same system will only work if you know your application well and can make compromises with both. I would also like to be able to handle load balancing read-only queries but that will be really tricky. You want up-to-date data and you want to be able to run moderately complex queries. That kind of workload may well require synchronous replication to really work properly. -- greg
Re: Hot Standby query cancellation and Streaming Replication integration
From
"Joshua D. Drake"
Date:
On Sat, 27 Feb 2010 00:43:48 +0000, Greg Stark <gsstark@mit.edu> wrote: > On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> This is also the reason why the whole "pause recovery" idea is a >> fruitless >> path to wander down. The whole point of this feature is that people >> have a >> secondary server available for high-availability, *first and foremost*, >> but >> they'd like it to do something more interesting that leave it idle all >> the >> time. The idea that you can hold off on applying standby updates for >> long >> enough to run seriously long reports is completely at odds with the idea >> of >> high-availability. > I want my ability to run large batch queries without any performance > or reliability impact on the primary server. +1 I can use any number of other technologies for high availability. Joshua D. Drake -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Fri, Feb 26, 2010 at 9:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: > >> What extra entries? > > Locks, just for starters. I haven't read enough of the code yet to know > what else Simon added. In the past it's not been necessary to record > any transient information in WAL, but now we'll have to. Haven't we been writing locks to the WAL since two-phase commit? -- greg
Joshua D. Drake wrote: > On Sat, 27 Feb 2010 00:43:48 +0000, Greg Stark <gsstark@mit.edu> wrote: > >> I want my ability to run large batch queries without any performance >> or reliability impact on the primary server. >> > > +1 > > I can use any number of other technologies for high availability. > Remove "must be an instant-on failover at the same time" from the requirements and you don't even need 9.0 to handle that, this has been a straightforward to solve problem since 8.2. It's the combination of HA and queries that make things hard to do. If you just want batch queries on another system without being concerned about HA at the same time, the first option is to just fork the base backup and WAL segment delivery to another server and run queries there. Some simple filesystem snapshot techniques will also suffice to handle it all on the same standby. Stop warm standby recovery, snapshot, trigger the server, run your batch job; once finished, rollback to the snapshot, grab the latest segment files, and resume standby catchup. Even the lame Linux LVM snapshot features can handle that job--one of my coworkers has the whole thing scripted even this is so common. And if you have to go live because there's a failover, you're back to the same "cold standby" situation a large max_standby_delay puts you at, so it's not even very different from what you're going to get in 9.0 if this is your priority mix. The new version is just lowering the operational complexity involved. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Stark wrote: > But if they move from having a plain old PITR warm standby to having > one they can run queries on they might well assume that the big > advantage of having the standby to play with is precisely that they > can do things there that they have never been able to do on the master > previously without causing damage. > Just not having the actual query running on the master is such a reduction in damage that I think it's delivering the essence of what people are looking for regardless. That it might be possible in some cases to additionally avoid the overhead that comes along with any long-running query is a nice bonus, and it's great the design allows for that possibility. But if that's only possible with risk, heavy tweaking, and possibly some hacks, I'm not sure that's making the right trade-offs for everyone. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
* Greg Smith <greg@2ndquadrant.com> [100226 23:39]: > Just not having the actual query running on the master is such a > reduction in damage that I think it's delivering the essence of what > people are looking for regardless. That it might be possible in some > cases to additionally avoid the overhead that comes along with any > long-running query is a nice bonus, and it's great the design allows for > that possibility. But if that's only possible with risk, heavy > tweaking, and possibly some hacks, I'm not sure that's making the right > trade-offs for everyone. Would we (ya, the royal we) be willing to say that if you want the benifit of removing the MVCC overhead of long-running queries you need to run PITR backup/archive recovery, and if you want SR, you get a closed-loop master-follows-save-xmin behaviour? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk wrote: > Would we (ya, the royal we) be willing to say that if you want the > benifit of removing the MVCC overhead of long-running queries you need > to run PITR backup/archive recovery, and if you want SR, you get a > closed-loop master-follows-save-xmin behaviour? > To turn that question around a little, I think it's reasonable to say that closed-loop master-follows-slave-xmin behavior is only practical to consider implementing with SR--and even there, it should be optional rather than required until there's more field experience on the whole thing. Whether it's the default or not could take a bit of debate to sort out too. If you think of it in those terms, the idea that "you need to run PITR backup/archive recovery" to not get that behavior isn't an important distinction anymore. If you run SR with the option enabled you could get it, any other setup and you won't. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Greg Stark wrote: > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There's *definitely* not going to be enough information in the WAL >> stream coming from a master that doesn't think it has HS slaves. >> We can't afford to record all that extra stuff in installations for >> which it's just useless overhead. BTW, has anyone made any attempt >> to measure the performance hit that the patch in its current form is >> creating via added WAL entries? > > What extra entries? * An xact-assignment record is written every PGPROC_MAX_CACHED_SUBXIDS (= 64) subtransaction ids assigned to a single top-level transaction. * A running-xacts record is written at every online checkpoint * A btree-reuse-page record is written whenever a dead b-tree page is recycled * A vacuum cleanup-info record is written once per VACUUM of a table * A standby-lock record is written for each AccessExclusiveLock acquired. Am I missing something? I doubt any of these are noticeable, though I don't think anyone has measured it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Josh Berkus wrote: >> That is exactly the core idea I was trying to suggest in my rambling >> message. Just that small additional bit of information transmitted and >> published to the master via that route, and it's possible to optimize >> this problem in a way not available now. And it's a way that I believe >> will feel more natural to some users who may not be well served by any >> of the existing tuning possibilities. > > Well, if both you and Tom think it would be relatively easy (or at least > easier that continuing to pursue query cancel troubleshooting), then > please start coding it. It was always a possible approach, we just > collectively thought that query cancel would be easier. You still need query cancels. A feedback loop just makes it happen less frequently. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Dimitri Fontaine wrote: > Bruce Momjian <bruce@momjian.us> writes: >> Doesn't the system already adjust the delay based on the length of slave >> transactions, e.g. max_standby_delay. It seems there is no need for a >> user switch --- just max_standby_delay really high. > > Well that GUC looks like it allows to set a compromise between HA and > reporting, not to say "do not ever give the priority to the replay while > I'm running my reports". At least that's how I understand it. max_standby_delay=-1 does that. The documentation needs to be updated to reflect that, it currently says: > There is no wait-forever setting because of the potential for deadlock which that setting would introduce. This parametercan only be set in the postgresql.conf file or on the server command line. but that is false, -1 means wait forever. Simon removed that option at one point, but it was later put back and apparently the documentation was never updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Hot Standby query cancellation and Streaming Replication integration
From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote: > Dimitri Fontaine wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Doesn't the system already adjust the delay based on the length of slave >>> transactions, e.g. max_standby_delay. It seems there is no need for a >>> user switch --- just max_standby_delay really high. >> Well that GUC looks like it allows to set a compromise between HA and >> reporting, not to say "do not ever give the priority to the replay while >> I'm running my reports". At least that's how I understand it. > > max_standby_delay=-1 does that. The documentation needs to be updated to > reflect that, it currently says: > >> There is no wait-forever setting because of the potential for deadlock which that setting would introduce. This parametercan only be set in the postgresql.conf file or on the server command line. > > but that is false, -1 means wait forever. Simon removed that option at > one point, but it was later put back and apparently the documentation > was never updated. I've put back the mention of max_standby_delay=-1 option in the docs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Greg Smith wrote: > Joshua D. Drake wrote: > > On Sat, 27 Feb 2010 00:43:48 +0000, Greg Stark <gsstark@mit.edu> wrote: > > > >> I want my ability to run large batch queries without any performance > >> or reliability impact on the primary server. > >> > > > > +1 > > > > I can use any number of other technologies for high availability. > > > > Remove "must be an instant-on failover at the same time" from the > requirements and you don't even need 9.0 to handle that, this has been a > straightforward to solve problem since 8.2. It's the combination of HA > and queries that make things hard to do. > > If you just want batch queries on another system without being concerned > about HA at the same time, the first option is to just fork the base > backup and WAL segment delivery to another server and run queries there. That is a lot of administrative overhead. It is hard to say it is equivalent to HS. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Tatsuo Ishii
Date:
> I think that what we are going to have to do before we can ship 9.0 > is rip all of that stuff out and replace it with the sort of closed-loop > synchronization Greg Smith is pushing. It will probably be several > months before everyone is forced to accept that, which is why 9.0 is > not going to ship this year. Oh, 9.0 will not be shipped in 2010? You guys share same opinion as Tom? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Greg, > If you think of it in those terms, the idea that "you need to run PITR > backup/archive recovery" to not get that behavior isn't an important > distinction anymore. If you run SR with the option enabled you could > get it, any other setup and you won't. +1. I always expected that we'd get this kind of behavior with "synch" in 9.1. I can see that there are two desired modes of behaviour depending on what the replication config is: 1) Master full-speed, degraded operation on slaves: this is the current wal_standby_delay approach. It has the advantage of supporting possibly hundreds of slaves, and certainly dozens. 2) Master burdened, full operation on slaves: this is the publish-xmin-back-to-master approach, which IIRC the core team first discussed at pgCon 2008 before Simon started work, and which you and Tom seem to think can be done soon. I can see people wanting to use either mode depending on their use-case.Or, for that matter, using both modes to differentslaves. Now that I think about it, the xmin thing really doesn't seem conceptually difficult. If the slave just opens a 2nd, special query connection back to the master and publishes its oldest xmin there, as far as the master is concerned, it's just another query backend. Could it be that easy? Also, I'm serious about what I suggested earlier for "delay" mode. We should have an option for cancelled queries to be immediately retried, if that's feasible. It would turn something which is now a major application design issue (lots of cancelled queries) into just degrated performance. Overall, though, I'd say that getting 9.0 out the door relatively on-time is more important than getting it perfect. "Release early, release often" isn't just a mantra; it's a very good idea if you want your project to keep improving and not bog down and fall apart. --Josh Berkus
Josh Berkus wrote: > Now that I think about it, the xmin thing really doesn't seem > conceptually difficult. If the slave just opens a 2nd, special query > connection back to the master and publishes its oldest xmin there, as > far as the master is concerned, it's just another query backend. > Could it be that easy? > Something just like that is in fact already suggested as a workaround in the Hot Standby manual: "The first option is to connect to the primary server and keep a query active for as long as needed to run queries on the standby. This guarantees that a WAL cleanup record is never generated and query conflicts do not occur, as described above. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms." And the idea of doing it mainly in client land has its attractions. The main reason I wandered toward asking about it in the context of SR is that there's already this open "Standby delay on idle system" issue with Hot Standby, and the suggested resolution for that problem involves publishing keep-alive data with timestamps over SR. While all these problems and potential solutions have been floating around for a long time, as you pointed out, the little flash of insight I had here was that it's possible to bundle these two problems together with a combined keep-alive timestamp+xmin message that goes in both directions. That removes one serious Hot Standby issue altogether, and adds an additional conflict avoidance mechanism for people who want to enable it, all with something that needs to get done sooner or later anyway for sync rep. The part I still don't have good visibility on is how much of the necessary SR infrastructure needed to support this communications channel is already available in some form. I had though the walsender on the master was already receiving messages sometimes from the walreceiver on the standby, but I'm getting the impression from Heikki's comments that this not the case at all yet. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> The part I still don't have good visibility on is how much of the > necessary SR infrastructure needed to support this communications > channel is already available in some form. I had though the walsender > on the master was already receiving messages sometimes from the > walreceiver on the standby, but I'm getting the impression from Heikki's > comments that this not the case at all yet. I don't think asking for a 2nd connection back from the standby to the master would be horrible for 9.0. I think it would be quite reasonable, actually; even with 2 connections per slave, you could still have quite a few slaves before the # of connections bogged down the master. --Josh
> Thank you for combining a small personal attack with a selfish > commentary about how yours is the only valid viewpoint. Saves me a lot > of trouble replying to your messages, can just ignore them instead if > this is how you're going to act. Hey, take it easy! I read Stark's post as tongue-in-cheek, which I think it was. Though, Stark, if you're going to be flip, I'd suggest using a smiley next time. --Josh
Josh Berkus wrote: > Hey, take it easy! I read Stark's post as tongue-in-cheek, which I > think it was. > Yeah, I didn't get that. We've already exchanged mutual off-list apologies for the misunderstanding in both directions, I stopped just short of sending flowers. I did kick off this discussion with noting a clear preference this not wander into any personal finger-pointing. And I am far too displeased with the technical situation here to have much of a sense of humor left about it either. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Fri, Feb 26, 2010 at 1:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> In the model you describe any long-lived queries on the slave cause >> tables in the master to bloat with dead records. > > Yup, same as they would do on the master. > >> I think this model is on the roadmap but it's not appropriate for >> everyone and I think one of the benefits of having delayed it is that >> it forces us to get the independent model right before throwing in >> extra complications. It would be too easy to rely on the slave >> feedback as an answer for hard questions about usability if we had it >> and just ignore the question of what to do when it's not the right >> solution for the user. > > I'm going to make an unvarnished assertion here. I believe that the > notion of synchronizing the WAL stream against slave queries is > fundamentally wrong and we will never be able to make it work. > The information needed isn't available in the log stream and can't be > made available without very large additions (and consequent performance > penalties). As we start getting actual beta testing we are going to > uncover all sorts of missed cases that are not going to be fixable > without piling additional ugly kluges on top of the ones Simon has > already crammed into the system. Performance and reliability will both > suffer. > > I think that what we are going to have to do before we can ship 9.0 > is rip all of that stuff out and replace it with the sort of closed-loop > synchronization Greg Smith is pushing. It will probably be several > months before everyone is forced to accept that, which is why 9.0 is > not going to ship this year. Somewhat unusually for me, I haven't been able to keep up with my email over the last few days, so I'm weighing in on this one a bit late. It seems to me that if we're forced to pass the xmin from the slave back to the master, that would be a huge step backward in terms of both scalability and performance, so I really hope it doesn't come to that. I wish I understood better exactly what you mean by "the notion of synchronizing the WAL stream against slave queries" and why you don't think it will work. Can you elaborate? ...Robert
Greg Smith wrote: > Josh Berkus wrote: > > > Now that I think about it, the xmin thing really doesn't seem > > conceptually difficult. If the slave just opens a 2nd, special query > > connection back to the master and publishes its oldest xmin there, as > > far as the master is concerned, it's just another query backend. > > Could it be that easy? > > > > Something just like that is in fact already suggested as a workaround in > the Hot Standby manual: > > "The first option is to connect to the primary server and keep a query > active for as long as needed to run queries on the standby. This > guarantees that a WAL cleanup record is never generated and query > conflicts do not occur, as described above. This could be done using > contrib/dblink and pg_sleep(), or via other mechanisms." I am unclear how you would easily advance the snapshot as each query completes on the slave. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Bruce Momjian wrote: >> "The first option is to connect to the primary server and keep a query >> active for as long as needed to run queries on the standby. This >> guarantees that a WAL cleanup record is never generated and query >> conflicts do not occur, as described above. This could be done using >> contrib/dblink and pg_sleep(), or via other mechanisms." >> > > I am unclear how you would easily advance the snapshot as each query > completes on the slave. > The idea of the workaround is that if you have a single long-running query to execute, and you want to make sure it doesn't get canceled because of a vacuum cleanup, you just have it connect back to the master to keep an open snapshot the whole time. That's basically the same idea that vacuum_defer_cleanup_age implements, except you don't have to calculate a value--you just hold open the snapshot to do it. When that query ended, its snapshot would be removed, and then the master would advance to whatever the next latest one is. Nothing fancier than that. The only similarity is that if you made every query that happened on the standby do that, it would effectively be the same behavior I'm suggesting could be available via the standby->master xmin publication. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sun, Feb 28, 2010 at 5:28 AM, Greg Smith <greg@2ndquadrant.com> wrote: > The idea of the workaround is that if you have a single long-running query > to execute, and you want to make sure it doesn't get canceled because of a > vacuum cleanup, you just have it connect back to the master to keep an open > snapshot the whole time. That's basically the same idea that > vacuum_defer_cleanup_age implements, except you don't have to calculate a > value--you just hold open the snapshot to do it. This sounds like it would require a separate connection for each client on the replica. That would be a pretty big burden for the master. Also, I'm not sure this actually works. When your client makes this additional connection to the master it's connecting at some transaction in the future from the slave's point of view. The master could have already vacuumed away some record which the snapshot the client gets on the slave will have in view. Even if you defer taking the snapshot on the slave until after connecting to the master it's still possibly "in the past" compared to the xmin on the master. I think to make this work you would have to connect to the master, establish a snapshot, then fetch pg_current_xlog_location(), then poll the slave and wait until it reaches that same position -- and only then perform your query taking care to establish a fresh snapshot for it such as by starting a new transaction on the slave. That's a lot of effort to go to. Still it's a handy practical trick even if it isn't 100% guaranteed to work. But I don't think it provides the basis for something we can bake in. -- greg
Robert Haas wrote: > It seems to me that if we're forced to pass the xmin from the > slave back to the master, that would be a huge step backward in terms > of both scalability and performance, so I really hope it doesn't come > to that. Not forced to--have the option of. There are obviously workloads where you wouldn't want this. At the same time, I think there are some pretty common ones people are going to expect HS+SR to work on transparently where this would obviously be the preferred trade-off to make, were it available as one of the options. The test case I put together shows an intentionally pathological but not completely unrealistic example of such a workload. > I wish I understood better exactly what you mean by "the > notion of synchronizing the WAL stream against slave queries" and why > you don't think it will work. Can you elaborate? > There's this constant WAL stream coming in from the master to the slave. Each time the slave is about to apply a change from that stream, it considers "will this disrupt one of the queries I'm already executing?". If so, it has to make a decision about what to do; that's where the synchronization problem comes from. The current two options are "delay applying the change", at which point the master and standby will drift out of sync until the query ends and it can catch back up, or "cancel the query". There are tunables for each of these, and they all seem to work fine (albeit without too much testing in the field yet). My concern is that the tunable that tries to implement the other thing you might want to optimize for--"avoid letting the master generate WAL entires that are the most likely ones to conflict"--just isn't very usable in its current form. Tom and I don't see completely eye to eye on this, in that I'm not so sure the current behaviors are "fundamentally wrong and we will never be able to make [them] work". If that's really the case, you may not ever get the scalability/performance results you're hoping for from this release, and really we're all screwed if those are the only approaches available. What I am sure of is that a SR-based xmin passing approach is simpler, easier to explain, more robust for some common workloads, and less likely to give surprised "wow, I didn't think *that* would cancel my standby query" reports from the field than any way you can configure Hot Standby alone right now. And since I never like to bet against Tom's gut feel, having it around as a "plan B" in case he's right about an overwhelming round of bug reports piling up against the max_standby_delay etc. logic doesn't hurt either. I spent a little time today seeing if there was any interesting code I might steal from the early "synchrep" branch at http://git.postgresql.org/gitweb?p=users/fujii/postgres.git;a=summary , but sadly when I tried to rebase that against the master to separate out just the parts unique to it the merge conflicts were overwhelming. I hate getting beaten by merge bitrot even when Git is helping. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Stark wrote: > On Sun, Feb 28, 2010 at 5:28 AM, Greg Smith <greg@2ndquadrant.com> wrote: > >> The idea of the workaround is that if you have a single long-running query >> to execute, and you want to make sure it doesn't get canceled because of a >> vacuum cleanup, you just have it connect back to the master to keep an open >> snapshot the whole time. > Also, I'm not sure this actually works. When your client makes this > additional connection to the master it's connecting at some > transaction in the future from the slave's point of view. The master > could have already vacuumed away some record which the snapshot the > client gets on the slave will have in view. Right, and there was an additional comment in the docs alluding to some sleep time on the master that intends to try and improve thins. If you knew how long archive_timeout was you could try to sleep longer than it to try and increase your odds of avoiding an ugly spot. But there are race conditions galore possible here, particularly if your archiver or standby catchup is backlogged. > Still it's a handy practical trick even if it isn't 100% guaranteed to > work. But I don't think it provides the basis for something we can > bake in. > Agreed on both counts, which is why it's in the current docs as a workaround people can consider, but not what I've been advocating as the right way to proceed. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sun, Feb 28, 2010 at 6:07 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Not forced to--have the option of. There are obviously workloads where you > wouldn't want this. At the same time, I think there are some pretty common > ones people are going to expect HS+SR to work on transparently where this > would obviously be the preferred trade-off to make, were it available as one > of the options. The test case I put together shows an intentionally > pathological but not completely unrealistic example of such a workload. Well if we're forced to eventually have both then it kind of takes the wind out of Tom's arguments. We had better get both features working so it becomes only a question of which is worth doing first and which can be held off. Since there aren't any actual bugs in evidence for the current setup and we already have it that's a pretty easy decision. > What I am sure of is that a SR-based xmin passing approach is simpler, > easier to explain, more robust for some common workloads, and less likely to > give surprised "wow, I didn't think *that* would cancel my standby query" > reports from the field Really? I think we get lots of suprised wows from the field from the idea that a long-running read-only query can cause your database to bloat. I think the only reason that's obvious to us is that we've been grappling with that problem for so long. > And since I never like to bet against Tom's gut feel, having it > around as a "plan B" in case he's right about an overwhelming round of bug > reports piling up against the max_standby_delay etc. logic doesn't hurt > either. Agreed. Though I think it'll be bad in that case even if we have a plan B. It'll mean no file-based log shipping replicas and no guarantee that what you run on the standby can't affect the master -- which is a pretty nice guarantee. It'll also mean it'll be much more fragile against network interruptions. -- greg
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Joachim Wieland
Date:
On Sun, Feb 28, 2010 at 2:54 PM, Greg Stark <gsstark@mit.edu> wrote: > Really? I think we get lots of suprised wows from the field from the > idea that a long-running read-only query can cause your database to > bloat. I think the only reason that's obvious to us is that we've been > grappling with that problem for so long. It seems to me that the scenario that you are looking at is one where people run different queries with and without HS, i.e. that they will run longer read-only queries than now once they have HS. I don't think that is the case. If it isn't you cannot really speak of a master "bloat". Instead, I assume that most people who will grab 9.0 and use HS+SR do already have a database with a certain query profile. Now with HS+SR they will try to put the most costly and longest read-only queries to the standby but in the end will run the same number of queries with the same overall complexity. Now let's take a look at both scenarios from the administrators' point of view: 1) With the current implementation they will see better performance on the master and more aggressive vacuum (!), since they have less long-running queries now on the master and autovacuum can kick in and clean up with less delay than before. On the other hand their queries on the standby might fail and they will start thinking that this HS+SR feature is not as convincing as they thought it was... Next step for them is to take the documentation and study it for a few days to learn all about vacuum, different delays, transaction ids and age parameters and experiment a few weeks until no more queries fail - for a while... But they can never be sure... In the end they might also modify the parameters in the wrong direction or overshoot because of lack of time to experiment and lose another important property without noticing (like being as close as possible to the master). 2) On the other hand if we could ship 9.0 with the xmin-propagation feature, people would still see a better performance and have a hot standby system but this time without query cancellations. Again: the read-only queries that will be processed by the HS in the future are being processed by the master today anyway, so why should it get worse? The first impression will be that it just works nicely out of the box, is easy to set up and has no negative effect (query cancellation) that has not already shown up before (vacuum lag). I guess that most people will just run fine with this setup and never get to know about the internals. Of course we should still offer an expert mode where you can turn all kinds of knobs and where you can avoid the vacuum dependency but it would be nice if this could be the expert mode only. Tuning this is highly installation specific and you need to have a deep understanding of how PostgreSQL and HS work internally and what you actually want to achieve... > Agreed. Though I think it'll be bad in that case even if we have a > plan B. It'll mean no file-based log shipping replicas and no > guarantee that what you run on the standby can't affect the master -- > which is a pretty nice guarantee. It'll also mean it'll be much more > fragile against network interruptions. Regarding the network interruptions... in reality if you have network interruptions of several minutes between your primary and your standby, you have worse problems anyway... If the standby does not renew its xmin for n seconds, log a message and just go on... Joachim
All, First, from the nature of the arguments, we need to eventually have both versions of SR: delay-based and xmin-pub. And it would be fantastic if Greg Smith and Tom Lane could work on xmin-pub to see if we can get it ready as well. I also think, based on the discussion and Greg's test case, that we could do two things which would make the shortcomings of delay-based SR a vastly better experience for users: 1) Automated retry of cancelled queries on the slave. I have no idea how hard this would be to implement, but it makes the difference between writing lots of exception-handling code for slave connections (unacceptable) to just slow response times on the slave (acceptable). 2) A more usable vacuum_defer_cleanup_age. If it was feasible for a user to configure the master to not vacuum records less than, say, 5 minutes dead, then that would again offer the choice to the user of slightly degraded performance on the master (acceptable) vs. lots of query cancel (unacceptable). I'm going to test Greg's case with vacuum_cleanup_age used fairly liberally to see if this approach has merit. Why do I say that "lots of query cancel" is "unacceptable"? For the simple reason that one cannot run the same application code against an HS+SR cluster with lots of query cancel as one runs against a standalone database. And if that's true, then the main advantage of HS+SR over Slony and Londiste is gone. MySQL took great pains to make sure that you could run the same code against replicated MySQL as standalone, and that was based on having a fairly intimate relationship with their users (at the time, anyway). Another thing to keep in mind in these discussions is the inexpensiveness of servers today. This means that, if slaves have poor performance, that's OK; one can always spin up more slaves. But if each slave imposes a large burden on the master, then that limits your scalability. --Josh Berkus
Joachim Wieland wrote: > Instead, I assume that most people who will grab 9.0 and use HS+SR do > already have a database with a certain query profile. Now with HS+SR > they will try to put the most costly and longest read-only queries to > the standby but in the end will run the same number of queries with > the same overall complexity. > This is a nice summary of the primary use-case I am trying to optimize usability for, because I know for a fact there's a stack of pent-up requests for exactly this form of improvement from existing warm standby users. And your subsequent discussion of how administrators will react in each of the possible configurations here matches my own concerns. I would highly recommend anyone who feels this is not a critical feature to fix carefully read Joachim's message from an advocacy perspective, that's a better user-oriented prediction than mine of exactly how this is going to play out in the field post-release. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Josh Berkus wrote: > First, from the nature of the arguments, we need to eventually have both > versions of SR: delay-based and xmin-pub. And it would be fantastic if > Greg Smith and Tom Lane could work on xmin-pub to see if we can get it > ready as well. > As I see it, the main technical obstacle here is that a subset of a feature already on the SR roadmap needs to get built earlier than expected to pull this off. I don't know about Tom, but I have no expectation it's possible for me to get up to speed on that code fast enough to contribute anything there. I expect the thing I'd be most productive at as far as moving the release forward is to continue testing this pair of features looking for rough edges, which is what I have planned for the next month. I'm not even close to finished with generating test cases specifically probing for bad behavior suspected after a look the implementation details--this is just what I came up with in my first week of that. Count me in for more testing, but out for significant development here. It's not what I've got my time allocated for because it's not where I think I'll be most productive. > 2) A more usable vacuum_defer_cleanup_age. If it was feasible for a > user to configure the master to not vacuum records less than, say, 5 > minutes dead, then that would again offer the choice to the user of > slightly degraded performance on the master (acceptable) vs. lots of > query cancel (unacceptable). I'm going to test Greg's case with > vacuum_cleanup_age used fairly liberally to see if this approach has merit. > I've been down that road and it leads quickly to the following question: "how can I tell how old in time-based units an xid is?" If there were an easy answer to that question, vacuum_defer_cleanup_age would already be set in time units. It's the obvious UI to want, it's just not obvious how to build it internally. Maybe I missed something, but my guess is that vacuum_defer_cleanup_age is already as good as it's going to get. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Dimitri Fontaine
Date:
Josh Berkus <josh@agliodbs.com> writes: > 2) A more usable vacuum_defer_cleanup_age. If it was feasible for a > user to configure the master to not vacuum records less than, say, 5 > minutes dead, then that would again offer the choice to the user of > slightly degraded performance on the master (acceptable) vs. lots of > query cancel (unacceptable). I'm going to test Greg's case with > vacuum_cleanup_age used fairly liberally to see if this approach has > merit. I think that to associate any time based interval notion with the XID flow, you need a ticker. We already took the txid and txid_snapshot types and functions from Skytools, which took them from Slony. Maybe we could consider borrowing pgqd, the C version of the ticker, for being able to specify in human time how long a dead transaction is allowed to remain in the heap? http://github.com/markokr/skytools-dev/tree/master/sql/ticker/ Regards, -- dim
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Joachim Wieland
Date:
On Sun, Feb 28, 2010 at 8:47 PM, Josh Berkus <josh@agliodbs.com> wrote: > 1) Automated retry of cancelled queries on the slave. I have no idea > how hard this would be to implement, but it makes the difference between > writing lots of exception-handling code for slave connections > (unacceptable) to just slow response times on the slave (acceptable). We're not only canceling queries, we are effectively canceling transactions. It seems quite impossible to repeat all queries from a transaction that has started in the past. One query might be or include the result of a previous query and as the data we see now has changed since then, the client might now want to execute a different query when it gets a different result out of a previous query... And even if it was possible, how often would you retry? You still have no guarantee that your query succeeds the second time. I'd claim that if a query failed once, chances are even higher that it fails again than that it succeeds the second time. Moreover if you continue to repeat the query and if queries come in at a certain rate, you need to process more and more queries on the slave which will not really help other queries to finish in time nor will it be beneficial for the throughput of the system as a whole... I fully agree with what you say about user expectations: We need to assume that many programs are not prepared for failures of "simple" read-only queries because in the past they have always worked... > Another thing to keep in mind in these discussions is the > inexpensiveness of servers today. This means that, if slaves have poor > performance, that's OK; one can always spin up more slaves. But if each > slave imposes a large burden on the master, then that limits your > scalability. The burden of the xmin-publication feature is not the number of slaves, it's just the longest running queries on whatever slave they are. So your argument applies to both cases... To minimize the burden on the master, get additional slaves so that you can run your most expensive queries in a shorter time :-) Joachim
On Fri, 2010-02-26 at 16:44 -0500, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> There's *definitely* not going to be enough information in the WAL > >> stream coming from a master that doesn't think it has HS slaves. > >> We can't afford to record all that extra stuff in installations for > >> which it's just useless overhead. BTW, has anyone made any attempt > >> to measure the performance hit that the patch in its current form is > >> creating via added WAL entries? > > > What extra entries? > > Locks, just for starters. I haven't read enough of the code yet to know > what else Simon added. In the past it's not been necessary to record > any transient information in WAL, but now we'll have to. There is room for technical confusion here, so I'll just add some info. There was/is potential for performance hit because of the volume of additional WAL *and* the processing overhead from that additional WAL. As Heikki points out these turn out to be minimal, though this has been by careful design. There is also potential for a performance hit because incoming cleanup records may conflict with currently executing queries. If we knew for certain that the master was not sending any cleanup records that would effect current standby queries we could avoid that overhead altogether. That's a good reason for integrating a solution. AccessExclusiveLock lock records are nothing at all to do with that. They exist simply to prevent obvious correctness issues such as somebody reading a file while it is being deleted. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2010-02-26 at 03:33 -0500, Greg Smith wrote: > I really hope this discussion can say focused on if and how it's > possible to improve this area, with the goal being to deliver a product > everyone can be proud of with the full feature set that makes this next > release a killer one. The features that have managed to all get into > this release already are fantastic, everyone who contributed should be > proud of that progress, and it's encouraging that the alpha4 date was > nailed. It would be easy to descend into finger-pointing for why > exactly this particular problem is only getting more visibility now, or > into schedule-oriented commentary suggesting it must be ignored because > it's too late to do anything about it. I hope everyone appreciates > wandering that way will not help make PostgreSQL 9.0 a better release. > This issue is so easy to encounter, and looks so bad when it happens, > that I feel it could easily lead to an embarrassing situation for the > community if something isn't done about it before release. Thanks Greg. It's a great relief for me to hear someone else say this and to watch a discussion about this important issue unfold. -- Simon Riggs www.2ndQuadrant.com
Greg, Joachim, > As I see it, the main technical obstacle here is that a subset of a > feature already on the SR roadmap needs to get built earlier than > expected to pull this off. I don't know about Tom, but I have no > expectation it's possible for me to get up to speed on that code fast > enough to contribute anything there. I expect the thing I'd be most > productive at as far as moving the release forward is to continue > testing this pair of features looking for rough edges, which is what I > have planned for the next month. That's OK with me. I thought you were saying that xmin-pub was going to be easier than expected. Per my other e-mails, I think that we should be shooting for "good enough, on time" for 9.0., rather than "perfect".We can't ever get to "perfect" if we don't releasesoftware. Quite frankly, simply telling people that "long-running queries on the slave tend not to be effective, wait for 9.1" is a possibility. If you consider the limitations and silent failures associated with MySQL replication, let alone the issues with other Postgres solutions or the replication of some of the nosql databases, "no long-running queries" is a positively straightforwards restriction. HS+SR is still a tremendous improvement over the options available previously. We never thought it was going to work for everyone everywhere, and shouldn't let our project's OCD tendencies run away from us. > I've been down that road and it leads quickly to the following > question: "how can I tell how old in time-based units an xid is?" If > there were an easy answer to that question, vacuum_defer_cleanup_age > would already be set in time units. It's the obvious UI to want, it's > just not obvious how to build it internally. Maybe I missed something, > but my guess is that vacuum_defer_cleanup_age is already as good as it's > going to get. Well, we could throw this on the user if we could get them some information on how to calculate that number. For example, some way for them to calculate the number of XIDs per minute via a query, and then set vacuum_defer_cleanup_age appropriately on the master. Sure, it's clunky, but we've already warned people that 9.0 will be clunky and hard to administer. And it's no worse than setting FSM_pages used to be. However, first we need to test that setting vacuum_defer_cleanup_age actually benefits query cancel issues. > We're not only canceling queries, we are effectively canceling > transactions. It seems quite impossible to repeat all queries from a > transaction that has started in the past. One query might be or > include the result of a previous query and as the data we see now has > changed since then, the client might now want to execute a different > query when it gets a different result out of a previous query... Sure, except that I don't expect people to be using explicit transactions as much on the slaves, since they are read-only anyway and can't even create temp tables. So having the retry not retry if there is an explicit transaction would be an OK option. > And even if it was possible, how often would you retry? You still have > no guarantee that your query succeeds the second time. I'd claim that > if a query failed once, chances are even higher that it fails again > than that it succeeds the second time. Moreover if you continue to > repeat the query and if queries come in at a certain rate, you need to > process more and more queries on the slave which will not really help > other queries to finish in time nor will it be beneficial for the > throughput of the system as a whole... Well, we'd need to have a limited number of retries, which means a GUC in recovery.conf: query_cancel_retry = # This might default to, say, 2. However, I'd still like to hear from someone with the requisite technical knowledge whether capturing and retrying the current query in a query cancel is even possible. --Josh Berkus
Josh Berkus wrote: > Well, we could throw this on the user if we could get them some > information on how to calculate that number. For example, some way for > them to calculate the number of XIDs per minute via a query, and then > set vacuum_defer_cleanup_age appropriately on the master. Sure, it's > clunky, but we've already warned people that 9.0 will be clunky and hard > to administer. And it's no worse than setting FSM_pages used to be. > > However, first we need to test that setting vacuum_defer_cleanup_age > actually benefits query cancel issues. > Proving that setting works as expected is already on my test case grid, seems fine in my limited testing so far. I've started looking into ways to monitor XID churn in a way for setting it better. I'll take care of providing all that in my next test case update. My intent here is to take the ideas outlined in my "Hot Standby Tradeoffs" blog post and turn that into a new documentation section making it more clear where the problem steps are, regardless of what else happens here. And I need some concrete example of XID burn rate measurement to finish that job. The main problem with setting vacuum_defer_cleanup_age high isn't showing it works, it's a pretty simple bit of code. It's when you recognize that it penalizes all cleanup all the time, whether or not the standby is actually executing a long-running query or not, that you note the second level of pain in increasing it. Returning to the idea of "how is this different from a site already in production?", it may very well be the case that a site that sets vacuum_defer_cleanup_age high enough to support off-peak batch reporting cannot tolerate how that will impact vacuums during their peak time of day. The XID export implementation sidesteps that issue by only making the vacuum delay increase when queries that require it are running, turning this back into a standard "what's the best time of day to run my big reports?" issue that people understand how to cope with already. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sun, Feb 28, 2010 at 5:38 PM, Josh Berkus <josh@agliodbs.com> wrote: > Greg, Joachim, > >> As I see it, the main technical obstacle here is that a subset of a >> feature already on the SR roadmap needs to get built earlier than >> expected to pull this off. I don't know about Tom, but I have no >> expectation it's possible for me to get up to speed on that code fast >> enough to contribute anything there. I expect the thing I'd be most >> productive at as far as moving the release forward is to continue >> testing this pair of features looking for rough edges, which is what I >> have planned for the next month. > > That's OK with me. I thought you were saying that xmin-pub was going to > be easier than expected. Per my other e-mails, I think that we should > be shooting for "good enough, on time" for 9.0., rather than "perfect". > We can't ever get to "perfect" if we don't release software. I agree. It seems to me that the right long term fix for the problem of query cancellations on the slave is going to be to give the slave the ability to save multiple versions of relation pages where necessary so that older snapshots can continue to be used even after the conflicting WAL has been applied. However, I'm pretty sure that's going to be a very difficult project which is unlikely to be coded by anyone any time soon, let alone merged. Until it does, we're going to force people to pick from a fairly unappealing menu of options: postpone WAL replay for long periods of time, cancel queries (perhaps even seemingly unrelated to what changed on the master), bloat the master. All of those options are seriously unpleasant. I think, though, that we have to think of this as being like the Windows port, or maybe even more significant than that, as an architectural change. I think it is going to take several releases for this feature to be well-understood and stable and have all the options we'd like it to have. It wouldn't surprise me if we get to 10.0 before we really have truly seamless replication. I don't expect Slony or Londiste or any of the other solutions that are out there now to get kicked to the curb by PG 9.0. Still, a journey of a thousand miles begins with the first step. Simon and many others have put a great deal of time and energy into getting us to the point where we are now, and if we let the fact that we haven't reached our ultimate goal keep us from putting what we have out there in front of our customers, I think we're going to regret that. I think the thing to do is to reposition our PR around these features.We should maybe even go so far as to call them "beta"or "experimental". We shouldn't tell people - this is going to be totally awesome. We should tell people - this is a big improvement, and it's still got some pretty significant limitations, but it's good stuff and it's going in a good direction. Overhyping what we have today is not going to be good for the project, and I'm frankly quite afraid that nothing we can possibly code between now and the release is going to measure up to what people are hoping for. We need to set our own expectations, and those of our customers, at a level at which they can be met. > Quite frankly, simply telling people that "long-running queries on the > slave tend not to be effective, wait for 9.1" is a possibility. Yep. > HS+SR is still a tremendous improvement over the options available > previously. We never thought it was going to work for everyone > everywhere, and shouldn't let our project's OCD tendencies run away from us. Yep. > However, I'd still like to hear from someone with the requisite > technical knowledge whether capturing and retrying the current query in > a query cancel is even possible. I'm not sure who you want to hear from here, but I think that's a dead end. ...Robert
On 2/28/10 7:00 PM, Greg Smith wrote: > The main problem with setting vacuum_defer_cleanup_age high isn't > showing it works, it's a pretty simple bit of code. It's when you > recognize that it penalizes all cleanup all the time, whether or not the > standby is actually executing a long-running query or not, that you note > the second level of pain in increasing it. Returning to the idea of > "how is this different from a site already in production?", it may very > well be the case that a site that sets vacuum_defer_cleanup_age high > enough to support off-peak batch reporting cannot tolerate how that will > impact vacuums during their peak time of day. The XID export > implementation sidesteps that issue by only making the vacuum delay > increase when queries that require it are running, turning this back > into a standard "what's the best time of day to run my big reports?" > issue that people understand how to cope with already. I don't think that defer_cleanup_age is a long-term solution. But we need *a* solution which does not involve delaying 9.0. And I think we can measure bloat in a pgbench test, no? When I get a chance, I'll run one for a couple hours and see the difference that cleanup_age makes. --Josh Berkus
Josh Berkus wrote: > And I think we can measure bloat in a pgbench test, no? When I get a > chance, I'll run one for a couple hours and see the difference that > cleanup_age makes. > The test case I attached at the start of this thread runs just the UPDATE to the tellers table. Running something similar that focuses just on UPDATEs to the pgbench_accounts table, without the rest of the steps done by the standard test, is the fastest route to bloat. The standard test will do it too, just does a lot of extra stuff too that doesn't impact results (SELECT, INSERT) so it wastes some resources compared to a targeted bloater script. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Mar 1, 2010 at 5:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > I don't think that defer_cleanup_age is a long-term solution. But we > need *a* solution which does not involve delaying 9.0. So I think the primary solution currently is to raise max_standby_age. However there is a concern with max_standby_age. If you set it to, say, 300s. Then run a 300s query on the slave which causes the slave to fall 299s behind. Now you start a new query on the slave -- it gets a snapshot based on the point in time that the slave is currently at. If it hits a conflict it will only have 1s to finish before the conflict causes the query to be cancelled. In short in the current setup I think there is no safe value of max_standby_age which will prevent query cancellations short of -1. If the slave has a constant stream of queries and always has at least one concurrent query running then it's possible that the slave will run continuously max_standby_age-epsilon behind the master and cancel queries left and right, regardless of how large max_standby_age is. To resolve this I think you would have to introduce some chance for the slave to catch up. Something like refusing to use a snapshot older than max_standby_age/2 and instead wait until the existing queries finish and the slave gets a chance to catch up and see a more recent snapshot. The problem is that this would result in very unpredictable and variable response times from the slave. A single long-lived query could cause replay to pause for a big chunk of max_standby_age and prevent any new query from starting. Does anyone see any way to guarantee that the slave gets a chance to replay and new snapshots will become visible without freezing out new queries for extended periods of time? -- greg
> So I think the primary solution currently is to raise max_standby_age. > > However there is a concern with max_standby_age. If you set it to, > say, 300s. Then run a 300s query on the slave which causes the slave > to fall 299s behind. Now you start a new query on the slave -- it gets > a snapshot based on the point in time that the slave is currently at. > If it hits a conflict it will only have 1s to finish before the > conflict causes the query to be cancelled. Completely aside from that, how many users are going to be happy with a slave server which is constantly 5 minutes behind? --Josh Berkus
On Mon, Mar 1, 2010 at 7:21 PM, Josh Berkus <josh@agliodbs.com> wrote: > Completely aside from that, how many users are going to be happy with a > slave server which is constantly 5 minutes behind? > Uhm, well all the ones who are happy with our current warm standby setup for one? And all the ones who are looking for a standby reporting server rather than a high availability DR site. For what it's worth Oracle has an option to have your standby intentionally hold back n minutes behind and I've seen that set to 5 minutes. -- greg
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Stefan Kaltenbrunner
Date:
Greg Stark wrote: > On Mon, Mar 1, 2010 at 7:21 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Completely aside from that, how many users are going to be happy with a >> slave server which is constantly 5 minutes behind? >> > > Uhm, well all the ones who are happy with our current warm standby > setup for one? > > And all the ones who are looking for a standby reporting server rather > than a high availability DR site. > > For what it's worth Oracle has an option to have your standby > intentionally hold back n minutes behind and I've seen that set to 5 > minutes. yeah a lot of people are doing that intentionally... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Greg Stark wrote: >> For what it's worth Oracle has an option to have your standby >> intentionally hold back n minutes behind and I've seen that set to 5 >> minutes. > yeah a lot of people are doing that intentionally... It's the old DBA screwup safety valve ... drop the main accounts table, you have five minutes to stop replication before it's dropped on the standby. Speaking of which, does the current HS+SR code have a provision to force the standby to stop tracking WAL and come up live, even when there's more WAL available? Because that's what you'd need in order for such a thing to be helpful in that scenario. regards, tom lane
On 3/1/10 11:43 AM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> Greg Stark wrote: >>> For what it's worth Oracle has an option to have your standby >>> intentionally hold back n minutes behind and I've seen that set to 5 >>> minutes. > >> yeah a lot of people are doing that intentionally... > > It's the old DBA screwup safety valve ... drop the main accounts table, > you have five minutes to stop replication before it's dropped on the > standby. Speaking of which, does the current HS+SR code have a > provision to force the standby to stop tracking WAL and come up live, > even when there's more WAL available? Because that's what you'd need > in order for such a thing to be helpful in that scenario. the "fast" recovery option should do this. You'd need some fast reaction times, though. However, this leaves aside Greg's point about snapshot age and successive queries; does anyone dispute his analysis? Simon? --Josh Berkus
On 2/28/10 7:12 PM, Robert Haas wrote: >> However, I'd still like to hear from someone with the requisite >> > technical knowledge whether capturing and retrying the current query in >> > a query cancel is even possible. > > I'm not sure who you want to hear from here, but I think that's a dead end. "dead end" as in "too hard to implement"? Or for some other reason? It's undeniable that auto-retry would be better from a user's perspective than a user-visible cancel. So if it's *reasonable* to implement, I think we should be working on it. I'm also very puzzled as to why nobody else wants to even discuss it; it's like some wierd blackout. --Josh Berkus
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
"Kevin Grittner"
Date:
Josh Berkus <josh@agliodbs.com> wrote: > It's undeniable that auto-retry would be better from a user's > perspective than a user-visible cancel. So if it's *reasonable* > to implement, I think we should be working on it. I'm also very > puzzled as to why nobody else wants to even discuss it; it's like > some wierd blackout. Well, at least for serializable transactions past the first statement, you'd need to have the complete *logic* for the transaction in order to do a retry. Not that this is a bad idea -- our application framework does this automatically -- but unless you only support this for a transaction which is wrapped up as a function, I don't see how the database itself could handle it. It might be *possible* to do it outside of a single-function transaction in a read committed transaction, but you'd have to be careful about locks. I remember suggesting automatic query retry (rather than continuing in a mixed-snapshot mode) for update conflicts in read committed mode and Tom had objections; you might want to check the archives for that. -Kevin
<p>josh, nobody is talking about it because it doesn't make sense. you could only retry if it was the first query in thetransaction and only if you could prove there were no side-effects outside the database and then you would have no reasonto think the retry would be any more likely to work. <p>greg<p><blockquote type="cite">On 1 Mar 2010 22:32, "Josh Berkus"<<a href="mailto:josh@agliodbs.com">josh@agliodbs.com</a>> wrote:<br /><br /><p><font color="#500050">On 2/28/107:12 PM, Robert Haas wrote:<br />>> However, I'd still like to hear from someone with the requ...</font> "deadend" as in "too hard to implement"? Or for some other reason?<br /><br /> It's undeniable that auto-retry would bebetter from a user's<br /> perspective than a user-visible cancel. So if it's *reasonable* to<br /> implement, I thinkwe should be working on it. I'm also very puzzled as<br /> to why nobody else wants to even discuss it; it's like somewierd blackout.<br /><font color="#888888"><br /> --Josh Berkus<br /></font><p><font color="#500050"><br />-- <br />Sentvia pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/>To make changes to your subs...</font></blockquote>
Greg Stark <stark@mit.edu> writes: > josh, nobody is talking about it because it doesn't make sense. you could > only retry if it was the first query in the transaction and only if you > could prove there were no side-effects outside the database and then you > would have no reason to think the retry would be any more likely to work. But it's hot standby, so there are no data-modifying transactions. Volatile functions could be a problem, though. A bigger problem is we might have already shipped partial query results to the client. I agree it ain't easy, but it might not be completely out of the question. Definitely won't be happening for 9.0 though. regards, tom lane
Josh Berkus wrote: > However, this leaves aside Greg's point about snapshot age and > successive queries; does anyone dispute his analysis? Simon? > There's already a note on the Hot Standby TODO about unexpectly bad max_standby_delay behavior being possible on an idle system, with no suggested resolution for it besides better SR integration. The issue Greg Stark has noted is another variation on that theme. It's already on my list of theorized pathological but as yet undemonstrated concerns that Simon and I identified, the one I'm working through creating a test cases to prove/disprove. I'm past "it's possible..." talks at this point though as not to spook anyone unnecessarily, and am only raising things I can show concrete examples of in action. White box testing at some point does require pausing one's investigation of what's in the box and getting on with the actual testing instead. The only real spot where my opinion diverges here that I have yet to find any situation where 'max_standby_delay=-1' makes any sense to me. When I try running my test cases with that setting, the whole system just reacts far too strangely. My first patch here is probably going to be adding more visibility into the situation when queries are blocking replication forever, because I think the times I find myself at "why is the system hung right now?" are when that happens and it's not obvious as an admin what's going on. Also, the idea that a long running query on the standby could cause an unbounded delay in replication is so foreign to my sensibilities that I don't ever include it in the list of useful solutions to the problems I'm worried about. The option is there, not disputing that it makes sense for some people because there seems some demand for it, just can't see how it fits into any of the use-cases I'm concerned about. I haven't said anything about query retry mainly because I can't imagine any way it's possible to build it in time for this release, so whether it's eventually feasible or not doesn't enter into what I'm worried about right now. In any case, I would prioritize that behind work on preventing the most common situations that cause cancellations in the first place, until those are handled so well that retry is the most effective improvement left to consider. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Mar 1, 2010 at 5:32 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 2/28/10 7:12 PM, Robert Haas wrote: >>> However, I'd still like to hear from someone with the requisite >>> > technical knowledge whether capturing and retrying the current query in >>> > a query cancel is even possible. >> >> I'm not sure who you want to hear from here, but I think that's a dead end. > > "dead end" as in "too hard to implement"? Or for some other reason? I think it's probably too hard to implement for the extremely limited set of circumstances in which it can work. See the other responses for some of the problems. There are others, too. Suppose that the plan for some particular query is to read a table with a hundred million records, sort it, and then do whatever with the results. After reading the first 99 million records, the transaction is cancelled and we have to start over. Maybe someone will say, fine, no problem - but it's certainly going to be user-visible. Especially if we retry more than once. I think we should focus our efforts initially on reducing the frequency of spurious cancels. What we're essentially trying to do here is refute the proposition "the WAL record I just replayed might change the result of this query". It's possibly equivalent to the halting problem (and certainly impossibly hard) to refute this proposition in every case where it is in fact false, but it sounds like what we have in place right now doesn't come close to doing as well as can be done. I just read through the current documentation and it doesn't really seem to explain very much about how HS decides which queries to kill. Can someone try to flesh that out a bit? It also uses the term "buffer cleanup lock", which doesn't seem to be used anywhere else in the documentation (though it does appear in the source tree, including README.HOT). ...Robert
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [100301 20:04]: > Greg Stark <stark@mit.edu> writes: > > josh, nobody is talking about it because it doesn't make sense. you could > > only retry if it was the first query in the transaction and only if you > > could prove there were no side-effects outside the database and then you > > would have no reason to think the retry would be any more likely to work. > > But it's hot standby, so there are no data-modifying transactions. > Volatile functions could be a problem, though. A bigger problem is > we might have already shipped partial query results to the client. But, since we know its a slave and that the reason the query was cancelled was because it's got a backlog of updates to apply, it's very likely that the data that the earlier parts of the transaction would be different... And then you have no idea if just blindly replaying all statements of the transaction successively is a good idea... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Joachim Wieland wrote: > 1) With the current implementation they will see better performance on > the master and more aggressive vacuum (!), since they have less > long-running queries now on the master and autovacuum can kick in and > clean up with less delay than before. On the other hand their queries > on the standby might fail and they will start thinking that this HS+SR > feature is not as convincing as they thought it was... Next step for > them is to take the documentation and study it for a few days to learn > all about vacuum, different delays, transaction ids and age parameters > and experiment a few weeks until no more queries fail - for a while... > But they can never be sure... In the end they might also modify the > parameters in the wrong direction or overshoot because of lack of time > to experiment and lose another important property without noticing > (like being as close as possible to the master). I assumed they would set max_standby_delay = -1 and be happy. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Josh Berkus wrote: > HS+SR is still a tremendous improvement over the options available > previously. We never thought it was going to work for everyone > everywhere, and shouldn't let our project's OCD tendencies run away from us. OCD (Obsessive-Compulsive Disorder) --- good one. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Bruce Momjian wrote: <blockquote cite="mid:201003020454.o224s4601113@momjian.us" type="cite"><pre wrap="">Joachim Wielandwrote: </pre><blockquote type="cite"><pre wrap="">1) With the current implementation they will see better performanceon the master and more aggressive vacuum (!), since they have less long-running queries now on the master and autovacuum can kick in and clean up with less delay than before. On the other hand their queries on the standby might fail and they will start thinking that this HS+SR feature is not as convincing as they thought it was... </pre></blockquote><pre wrap=""> I assumed they would set max_standby_delay = -1 and be happy. </pre></blockquote><br /> The admin in this situation mightbe happy until the first time the primary fails and a failover is forced, at which point there is an unbounded amountof recovery data to apply that was stuck waiting behind whatever long-running queries were active. I don't know ifyou've ever watched what happens to a pre-8.2 cold standby when you start it up with hundreds or thousands of backed upWAL files to process before the server can start, but it's not a fast process. I watched a production 8.1 standby get>4000 files behind once due to an archive_command bug, and it's not something I'd like to ever chew my nails off toagain. If your goal was HA and you're trying to bring up the standby, the server is down the whole time that's going on.<br/><br /> This is why no admin who prioritizes HA would consider 'max_standby_delay = -1' a reasonable setting, andthose are the sort of users Joachim's example was discussing. Only takes one rogue query that runs for a long time tomake the standby so far behind it's useless for HA purposes. And you also have to ask yourself "if recovery is haltedwhile waiting for this query to run, how stale is the data on the standby getting?". That's true for any large settingfor this parameter, but using -1 for the unlimited setting also gives the maximum possible potential for such staleness.<br/><br /> 'max_standby_delay = -1' is really only a reasonable idea if you are absolutely certain all queriesare going to be short, which we can't dismiss as an unfounded use case so it has value. I would expect you have toalso combine it with a matching reasonable statement_timeout to enforce that expectation to make that situation safer.<br/><br /> In any of the "offload batch queries to the failover standby" situations, it's unlikely an unlimited valuefor this setting will be practical. Perhaps you set max_standby_delay to some number of hours, to match your expectedworst-case query run time and reduce the chance of cancellation. Not putting a limit on it at all is a situationno DBA with healthy paranoia is going to be happy with the potential downside of in a HA environment, given thatboth unbounded staleness and recovery time are then both possible. The potential of a failed long-running query is muchless risky than either of those.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> </pre>
Robert Haas wrote: > I just read through the current documentation and it doesn't really > seem to explain very much about how HS decides which queries to kill. > Can someone try to flesh that out a bit? I believe it just launches on a mass killing spree once things like max_standby_delay expire. This I want to confirm via testing (can simulate with a mix of long and short running pgbench queries) and then intend to update the docs to clarify. > It also uses the term > "buffer cleanup lock", which doesn't seem to be used anywhere else in > the documentation (though it does appear in the source tree, including > README.HOT). > This loose end was already noted in my last docs update. I wrote an initial description, but Bruce and I decided to leave out until something more thorough could be put together. This is also on my docs cleanup list, will get to it somewhere along the beta timeline. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Greg Smith wrote: > > I assumed they would set max_standby_delay = -1 and be happy. > > > > The admin in this situation might be happy until the first time the > primary fails and a failover is forced, at which point there is an > unbounded amount of recovery data to apply that was stuck waiting behind > whatever long-running queries were active. I don't know if you've ever > watched what happens to a pre-8.2 cold standby when you start it up with > hundreds or thousands of backed up WAL files to process before the > server can start, but it's not a fast process. I watched a production > 8.1 standby get >4000 files behind once due to an archive_command bug, > and it's not something I'd like to ever chew my nails off to again. If > your goal was HA and you're trying to bring up the standby, the server > is down the whole time that's going on. > > This is why no admin who prioritizes HA would consider > 'max_standby_delay = -1' a reasonable setting, and those are the sort of > users Joachim's example was discussing. Only takes one rogue query that > runs for a long time to make the standby so far behind it's useless for > HA purposes. And you also have to ask yourself "if recovery is halted > while waiting for this query to run, how stale is the data on the > standby getting?". That's true for any large setting for this > parameter, but using -1 for the unlimited setting also gives the maximum > possible potential for such staleness. > > 'max_standby_delay = -1' is really only a reasonable idea if you are > absolutely certain all queries are going to be short, which we can't > dismiss as an unfounded use case so it has value. I would expect you > have to also combine it with a matching reasonable statement_timeout to > enforce that expectation to make that situation safer. Well, as you stated in your blog, you are going to have one of these downsides: o master bloato delayed recoveryo cancelled queries Right now you can't choose "master bloat", but you can choose the other two. I think that is acceptable for 9.0, assuming the other two don't have the problems that Tom foresees. Our documentation should probably just come how and state that clearly. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Bruce Momjian wrote: > > 'max_standby_delay = -1' is really only a reasonable idea if you are > > absolutely certain all queries are going to be short, which we can't > > dismiss as an unfounded use case so it has value. I would expect you > > have to also combine it with a matching reasonable statement_timeout to > > enforce that expectation to make that situation safer. > > Well, as you stated in your blog, you are going to have one of these > downsides: > > o master bloat > o delayed recovery > o cancelled queries > > Right now you can't choose "master bloat", but you can choose the other > two. I think that is acceptable for 9.0, assuming the other two don't > have the problems that Tom foresees. I was wrong. You can choose "master bloat" with vacuum_defer_cleanup_age, but only crudely because it is measured in xids and the master defers no matter what queries are running on the slave, and there is still the possibility for query cancel for long queries. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
On 3/2/10 10:30 AM, Bruce Momjian wrote: > Right now you can't choose "master bloat", but you can choose the other > two. I think that is acceptable for 9.0, assuming the other two don't > have the problems that Tom foresees. Actually, if vacuum_defer_cleanup_age can be used, "master bloat" is an option. Hopefully I'll get some time for serious testing this weekend. --Josh Berkus
Bruce Momjian wrote:<br /><blockquote cite="mid:201003021834.o22IYX529089@momjian.us" type="cite"><blockquote type="cite"><prewrap="">Right now you can't choose "master bloat", but you can choose the other two. I think that is acceptable for 9.0, assuming the other two don't have the problems that Tom foresees. </pre></blockquote><pre wrap=""> I was wrong. You can choose "master bloat" with vacuum_defer_cleanup_age, but only crudely because it is measured in xids and the master defers no matter what queries are running on the slave...</pre></blockquote><br /> OK with you finding the situation acceptable, so long as it's an informed decision. Fromhow you're writing about this, I'm comfortable you (and everybody else still involved here) have absorbed the issuesenough that we're all talking about the same thing now. Since there are a couple of ugly user-space hacks possiblefor prioritizing "master bloat", and nobody is stepping up to work on resolving this via my suggestion involvingbetter SR integration, seems to me heated discussion of code changes has come to a resolution of sorts I (and Simon,just checked) can live with. Sounds like we have three action paths here:<br /><br /> -Tom already said he was planninga tour through the HS/SR code, I wanted that to happen with him aware of this issue.<br /> -Josh will continue doinghis testing, also better informed about this particular soft spot.<br /> -I'll continue test-case construction for theproblems here there are still concerns about (pathologic max_standby_delay and b-tree split issues being the top two onthat list), and keep sharing particularly interesting ones here to help everyone else's testing. <br /><br /> If it turnsout any of those paths leads to a must-fix problem that doesn't have an acceptable solution, at least the idea of thisas a "plan B" is both documented and more widely understood then when I started ringing this particular bell.<br /><br/> I just updated the Open Items list: <a class="moz-txt-link-freetext" href="http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items">http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items</a> toofficially put myself on the hook for the following HS related documentation items that have come up recently, aiming toget them all wrapped up in time before or during early beta:<br /><br /> -Update Hot Standby documentation: clearly explainrelationships between the 3 major setup trade-offs, "buffer cleanup lock", notes on which queries are killed oncemax_standby_delay is reached, measuring XID churn on master for setting vacuum_defer_cleanup_age<br /> -Clean up archive_commanddocs related to recent "/bin/true" addition. Given that's where I expect people who run into the pg_stop_backupwarning message recently added will end up at, noting its value for escaping from that particular case mightbe useful too.<br /><br /> To finish airing my personal 9.0 TODO list now that I've gone this far, I'm also still workingon completing the following patches that initial versions have been submitted of, was close to finishing both beforegetting side-tracked onto this larger issue:<br /><br /> -pgbench > 4000 scale bug fix: <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/message-id/4B621BA3.7090306@2ndquadrant.com">http://archives.postgresql.org/message-id/4B621BA3.7090306@2ndquadrant.com</a><br />-Improving the logging/error reporting/no timestamp issues in pg_standby re-raised recently by Selena: <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/message-id/2b5e566d1001250945oae17be8n6317f827e3bd7492@mail.gmail.com">http://archives.postgresql.org/message-id/2b5e566d1001250945oae17be8n6317f827e3bd7492@mail.gmail.com</a><br /><br/> If nobody else claims them as something they're working on before, I suspect I'll then move onto building some ofthe archiver UI improvements discussed most recently as part of the "pg_stop_backup does not complete" thread, despiteHeikki having crushed my dreams of a simple solution to those by pointing out the shared memory memory limitationinvolved.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> </pre>
On Mon, 2010-03-01 at 12:04 -0800, Josh Berkus wrote: > does anyone dispute his analysis? Simon? No dispute. I think I've discussed this before. -- Simon Riggs www.2ndQuadrant.com
On Mon, 2010-03-01 at 14:43 -0500, Tom Lane wrote: > Speaking of which, does the current HS+SR code have a > provision to force the standby to stop tracking WAL and come up live, > even when there's more WAL available? Yes, trigger file. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-02-28 at 16:56 +0100, Joachim Wieland wrote: > Now let's take a look at both scenarios from the administrators' point > of view: Well argued, agree with all of your points. -- Simon Riggs www.2ndQuadrant.com
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Greg Stark wrote: > On Mon, Mar 1, 2010 at 5:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > > I don't think that defer_cleanup_age is a long-term solution. ?But we > > need *a* solution which does not involve delaying 9.0. > > So I think the primary solution currently is to raise max_standby_age. > > However there is a concern with max_standby_age. If you set it to, > say, 300s. Then run a 300s query on the slave which causes the slave > to fall 299s behind. Now you start a new query on the slave -- it gets > a snapshot based on the point in time that the slave is currently at. > If it hits a conflict it will only have 1s to finish before the > conflict causes the query to be cancelled. > > In short in the current setup I think there is no safe value of > max_standby_age which will prevent query cancellations short of -1. If > the slave has a constant stream of queries and always has at least one > concurrent query running then it's possible that the slave will run > continuously max_standby_age-epsilon behind the master and cancel > queries left and right, regardless of how large max_standby_age is. > > To resolve this I think you would have to introduce some chance for > the slave to catch up. Something like refusing to use a snapshot older > than max_standby_age/2 and instead wait until the existing queries > finish and the slave gets a chance to catch up and see a more recent > snapshot. The problem is that this would result in very unpredictable > and variable response times from the slave. A single long-lived query > could cause replay to pause for a big chunk of max_standby_age and > prevent any new query from starting. That is a good point. I have added the attached documentation patch to mention that max_standby_delay increases the master/slave inconsistency, and not to use it for xid-keepalive connections. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do Index: doc/src/sgml/config.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.256 diff -c -c -r1.256 config.sgml *** doc/src/sgml/config.sgml 27 Feb 2010 14:46:05 -0000 1.256 --- doc/src/sgml/config.sgml 2 Mar 2010 21:03:14 -0000 *************** *** 1869,1875 **** this parameter makes sense only during replication, so when performing an archive recovery to recover from data loss a very high parameter setting or -1 which means wait forever is recommended. ! The default is 30 seconds. This parameter can only be set in the <filename>postgresql.conf</> file or on the server command line. </para> --- 1869,1876 ---- this parameter makes sense only during replication, so when performing an archive recovery to recover from data loss a very high parameter setting or -1 which means wait forever is recommended. ! The default is 30 seconds. Increasing this parameter can delay ! master server changes from appearing on the standby. This parameter can only be set in the <filename>postgresql.conf</> file or on the server command line. </para> Index: doc/src/sgml/high-availability.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/high-availability.sgml,v retrieving revision 1.52 diff -c -c -r1.52 high-availability.sgml *** doc/src/sgml/high-availability.sgml 27 Feb 2010 09:29:20 -0000 1.52 --- doc/src/sgml/high-availability.sgml 2 Mar 2010 21:03:14 -0000 *************** *** 1410,1416 **** that the primary and standby nodes are linked via the WAL, so the cleanup situation is no different from the case where the query ran on the primary node itself. And you are still getting the benefit of off-loading the ! execution onto the standby. </para> <para> --- 1410,1418 ---- that the primary and standby nodes are linked via the WAL, so the cleanup situation is no different from the case where the query ran on the primary node itself. And you are still getting the benefit of off-loading the ! execution onto the standby. <varname>max_standby_delay</> should ! not be used in this case because delayed WAL files might already ! contain entries that invalidate the current shapshot. </para> <para>
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Greg Stark wrote: > On Mon, Mar 1, 2010 at 5:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > > I don't think that defer_cleanup_age is a long-term solution. ?But we > > need *a* solution which does not involve delaying 9.0. > > So I think the primary solution currently is to raise max_standby_age. > > However there is a concern with max_standby_age. If you set it to, > say, 300s. Then run a 300s query on the slave which causes the slave > to fall 299s behind. Now you start a new query on the slave -- it gets > a snapshot based on the point in time that the slave is currently at. > If it hits a conflict it will only have 1s to finish before the > conflict causes the query to be cancelled. > > In short in the current setup I think there is no safe value of > max_standby_age which will prevent query cancellations short of -1. If > the slave has a constant stream of queries and always has at least one > concurrent query running then it's possible that the slave will run > continuously max_standby_age-epsilon behind the master and cancel > queries left and right, regardless of how large max_standby_age is. This is sobering. I have added the attached documentation so at least this odd behavior is documented. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do Index: doc/src/sgml/config.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.257 diff -c -c -r1.257 config.sgml *** doc/src/sgml/config.sgml 2 Mar 2010 21:18:59 -0000 1.257 --- doc/src/sgml/config.sgml 2 Mar 2010 23:34:38 -0000 *************** *** 1862,1879 **** <listitem> <para> When server acts as a standby, this parameter specifies a wait policy ! for queries that conflict with data changes being replayed by recovery. If a conflict should occur the server will delay up to this number ! of seconds before it begins trying to resolve things less amicably, as ! described in <xref linkend="hot-standby-conflict">. Typically, ! this parameter makes sense only during replication, so when ! performing an archive recovery to recover from data loss a very high ! parameter setting or -1 which means wait forever is recommended. ! The default is 30 seconds. Increasing this parameter can delay ! master server changes from appearing on the standby. This parameter can only be set in the <filename>postgresql.conf</> file or on the server command line. </para> </listitem> </varlistentry> --- 1862,1892 ---- <listitem> <para> When server acts as a standby, this parameter specifies a wait policy ! for applying WAL entries that conflict with active queries. If a conflict should occur the server will delay up to this number ! of seconds before it cancels conflicting queries, as ! described in <xref linkend="hot-standby-conflict">. ! Typically, this parameter is used only during replication. ! The default is 30 seconds. This parameter can only be set in the <filename>postgresql.conf</> file or on the server command line. </para> + <para> + A high value makes query cancel less likely, and -1 + causes the standby to wait forever for a conflicting query to + complete. Increasing this parameter might delay master server + changes from appearing on the standby. + </para> + <para> + While it is tempting to believe that <varname>max_standby_delay</> + is the maximum number of seconds a query can run before + cancellation is possible, this is not true. When a long-running + query ends, there is a finite time required to apply backlogged + WAL logs. If a second long-running query appears before the + WAL has caught up, the snapshot taken by the second query will + allow significantly less than <varname>max_standby_delay</> + before query cancellation is possible. + </para> </listitem> </varlistentry>
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Greg Smith wrote: > Bruce Momjian wrote: > >> Right now you can't choose "master bloat", but you can choose the other > >> two. I think that is acceptable for 9.0, assuming the other two don't > >> have the problems that Tom foresees. > >> > > > > I was wrong. You can choose "master bloat" with > > vacuum_defer_cleanup_age, but only crudely because it is measured in > > xids and the master defers no matter what queries are running on the > > slave... > > OK with you finding the situation acceptable, so long as it's an > informed decision. From how you're writing about this, I'm comfortable Well, consider that in November we were not even sure SR or HS would be in 9.0. We got them both, so if it is a little kudgy in 9.0, that's OK. We are much farther along than I ever expected. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
All, I've been playing with vacuum_defer_cleanup_age in reference to the query cancel problem. It really seems to me that this is the way forward in terms of dealing with query cancel for normal operation rather than wal_standby_delay, or maybe in combination with it. As a first test, I set up a deliberately pathological situation with pgbench and a wal_standby_delay of 1 second. This allowed me to trigger query cancel on a relatively simple reporting query; in fact, to make it impossible to complete. Then I increased vacuum_defer_cleanup_age to 100000, which represents about 5 minutes of transactions on the test system. This eliminated all query cancels for the reporting query, which takes an average of 10s. Next is a database bloat test, but I'll need to do that on a system with more free space than my laptop. --Josh Berkus
On Wed, Mar 10, 2010 at 6:29 AM, Josh Berkus <josh@agliodbs.com> wrote: > Then I increased vacuum_defer_cleanup_age to 100000, which represents > about 5 minutes of transactions on the test system. This eliminated all > query cancels for the reporting query, which takes an average of 10s. > > Next is a database bloat test, but I'll need to do that on a system with > more free space than my laptop. Note that this will be heavily dependent on the use case. If you have one of those counter records that keeps being updated and gets cleaned up by HOT whenever the page fills up then you need to allow HOT to clean it up before it overflows the page or else it'll bloat the table and require a real vacuum. I think that means that a vacuum_defer_cleanup of up to about 100 or so (it depends on the width of your counter record) might be reasonable as a general suggestion but anything higher will depend on understanding the specific system. Another use case that might suprise people who are accustomed to the current behaviour is massive updates. This is the main really pessimal use case left in Postgres -- ideally they wouldn't bloat the table at all but currently they double the size of the table. People may be accustomed to the idea that they can then run vacuum and that will limit the bloat to 50%, assuming they have no (other) long-lived transactions. With vacuum_defer_cleanup that will no longer be true. It will be as if you always have a query lasting n transactions in your system at all times. -- greg
On 3/10/10 3:38 AM, Greg Stark wrote: > I think that means that a > vacuum_defer_cleanup of up to about 100 or so (it depends on the width > of your counter record) might be reasonable as a general suggestion > but anything higher will depend on understanding the specific system. 100 wouldn't be useful at all. It would increase bloat without doing anything about query cancel except on a very lightly used system. > With vacuum_defer_cleanup that will no longer be true. > It will be as if you always have a query lasting n transactions in > your system at all times. Yep, but until we get XID-publish-to-master working in 9.1, I think it's probably the best we can do. At least it's no *worse* than having a long-running query on the master at all times. --Josh Berkus
On Wed, Mar 10, 2010 at 3:29 PM, Josh Berkus <josh@agliodbs.com> wrote: > I've been playing with vacuum_defer_cleanup_age in reference to the > query cancel problem. It really seems to me that this is the way > forward in terms of dealing with query cancel for normal operation > rather than wal_standby_delay, or maybe in combination with it. Why isn't vacuum_defer_cleanup_age listed on postgresql.conf.sample? Though I also tried to test the effect of it, I was unable to find it in the conf file. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: Hot Standby query cancellation and Streaming Replication integration
From
Bruce Momjian
Date:
Fujii Masao wrote: > On Wed, Mar 10, 2010 at 3:29 PM, Josh Berkus <josh@agliodbs.com> wrote: > > I've been playing with vacuum_defer_cleanup_age in reference to the > > query cancel problem. ?It really seems to me that this is the way > > forward in terms of dealing with query cancel for normal operation > > rather than wal_standby_delay, or maybe in combination with it. > > Why isn't vacuum_defer_cleanup_age listed on postgresql.conf.sample? > Though I also tried to test the effect of it, I was unable to find it > in the conf file. I asked about that last week and for some reason Simon didn't want it added and Greg Smith was going to get this corrected. Greg? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
> Why isn't vacuum_defer_cleanup_age listed on postgresql.conf.sample? > Though I also tried to test the effect of it, I was unable to find it > in the conf file. Using it has some bugs we need to clean up, apparently. --Josh Berkus