Thread: Summary and Plan for Hot Standby
After some time thinking about the best way forward for Hot Standby, I have some observations and proposals. First, the project is very large. We have agreed ways to trim the patch, yet it remains large. Trying to do everything in one lump is almost always a bad plan, so we need to phase things. Second, everybody is keen that HS hits the tree, so we can have alpha code etc.. There are a few remaining issues that should *not* be rushed. The only way to remove this dependency is to decouple parts of the project. Third, testing the patch is difficult and continuous change makes it harder to guarantee everything is working. There are two remaining areas of significant thought/effort: * Issues relating to handling of prepared transactions * How fast Hot Standby mode is enabled in the standby I propose that we stabilise and eventually commit a version of HS that circumvents/defers those issues and then address the issues with separate patches afterwards. This approach will allow us to isolate the areas of further change so we can have a test blitz to remove silly mistakes, then follow it with a commit to CVS, and then release as Alpha to allow further testing. Let's look at the two areas of difficulty in more detail * Issues relating to handling of prepared transactions There are some delicate issues surrounding what happens at the end of recovery if there is a prepared transaction still holding an access exclusive lock. It is straightforward to say, as an interim measure, "Hot Standby will not work with max_prepared_transactions > 0". I see that this has a fiddly, yet fairly clear solution. * How fast Hot Standby mode is enabled in the standby We need to have full snapshot information on the standby before we can allow connections and queries. There are two basic approaches: i) we wait until we *know* we have full info or ii) we try to collect data and inject a correct starting condition. Waiting (i) may take a while, but is clean and requires only a few lines of code. Injecting the starting condition (ii) requires boatloads of hectic code and we have been unable to agree a way forwards. If we did have that code, all it would give us is a faster/more reliable starting point for connections on the standby. Until we can make approach (ii) work, we should just rely on the easy approach (i). In many cases, the starting point is very similar. (In some cases we can actually make (i) faster because the overhead of data collection forces us to derive the starting conditions minutes apart.) Phasing the commit seems like the only way. Please can we agree a way forwards? -- Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 09:06, Simon Riggs <simon@2ndquadrant.com> wrote: > * Issues relating to handling of prepared transactions > There are some delicate issues surrounding what happens at the end of > recovery if there is a prepared transaction still holding an access > exclusive lock. It is straightforward to say, as an interim measure, > "Hot Standby will not work with max_prepared_transactions > 0". I see > that this has a fiddly, yet fairly clear solution. If that restriction will solve issues we have now, I find that a perfectly reasonable restriction. Even if it were to still be there past release, and only get fixed in a future release. The vast majority of our users don't use 2PC at all. Most cases where people had it enalbed used to be because it was enabled by default, and the large majority of cases where I've seen people increase it has actually been because they thought it meant prepared statements, not prepared transactions. So definitely +1. > * How fast Hot Standby mode is enabled in the standby > We need to have full snapshot information on the standby before we can > allow connections and queries. There are two basic approaches: i) we > wait until we *know* we have full info or ii) we try to collect data and > inject a correct starting condition. Waiting (i) may take a while, but > is clean and requires only a few lines of code. Injecting the starting > condition (ii) requires boatloads of hectic code and we have been unable > to agree a way forwards. If we did have that code, all it would give us > is a faster/more reliable starting point for connections on the standby. > Until we can make approach (ii) work, we should just rely on the easy > approach (i). In many cases, the starting point is very similar. (In > some cases we can actually make (i) faster because the overhead of data > collection forces us to derive the starting conditions minutes apart.) That also seems perfectly reasonable, depending on how long the waiting on (i) will be :-) What does the time depend on? > Phasing the commit seems like the only way. Yeah, we usually recommend that in other cases, so I don't see why it shouldn't apply to HS. Seems like a good way forward. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote: > What does the time depend on? We need to wait for all current transactions to complete. (i.e. any backend that has (or could) take an xid or an AccessExclusiveLock before it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY. The standby already performs this wait in the case where we overflow the snapshot, so we have >64 subtransactions on *any* current transaction on the master. The reason for that is (again) performance on master: we choose not to WAL log new subtransactions. There are various ways around this and I'm certain we'll come up with something ingenious but my main point is that we don't need to wait for this issue to be solved in order for HS to be usable. -- Simon Riggs www.2ndQuadrant.com
On Sunday, November 15, 2009, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2009-11-15 at 10:00 +0100, Magnus Hagander wrote: > >> What does the time depend on? > > We need to wait for all current transactions to complete. (i.e. any > backend that has (or could) take an xid or an AccessExclusiveLock before > it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY. > > The standby already performs this wait in the case where we overflow the > snapshot, so we have >64 subtransactions on *any* current transaction on > the master. The reason for that is (again) performance on master: we > choose not to WAL log new subtransactions. > > There are various ways around this and I'm certain we'll come up with > something ingenious but my main point is that we don't need to wait for > this issue to be solved in order for HS to be usable. Yeah, with that explanation (thanks for clearing it up) I agree - it will definitely still be hugely useful even with this restriction, so we realy don't need to delay an initial (or the alpha at least) commit. Thus, +1 on the second one as well :) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Simon Riggs wrote: > We need to wait for all current transactions to complete. (i.e. any > backend that has (or could) take an xid or an AccessExclusiveLock before > it commits.). Similar-ish to the wait for a CREATE INDEX CONCURRENTLY. > > The standby already performs this wait in the case where we overflow the > snapshot, so we have >64 subtransactions on *any* current transaction on > the master. The reason for that is (again) performance on master: we > choose not to WAL log new subtransactions. WAL-logging every new subtransaction wouldn't actually help. The problem with subtransactions is that if the subxid cache overflows in the proc array in the master, the information about the parent-child relationshiop is only stored in pg_subtrans, not in proc array. So when we take the running-xacts snapshot, we can't include that information, because there's no easy and fast way to scan pg_subtrans for it. Because that information is not included in the snapshot, the standby doesn't have all the information it needs until after it has seen that all the transactions that had an overflowed xid cache have finished. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > * Issues relating to handling of prepared transactions > There are some delicate issues surrounding what happens at the end of > recovery if there is a prepared transaction still holding an access > exclusive lock. Can you describe in more detail what problem this is again? We had various problems with prepared transactions, but I believe what's in the git repository now handles all those cases (although I just noticed and fixed a bug in it, so it's not very well tested or reviewed yet). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, Nov 15, 2009 at 3:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Please can we agree a way forwards? I don't have a strong position on the technical issues, but I am very much in favor of getting something committed, even something with limitations, as soon as we practically can. Getting this feature into the tree will get a lot more eyeballs on it, and it's much better to do that now, while we still have several months remaining before beta, so those eyeballs can be looking at it for longer - and testing it as part of the regular alpha release process. It will also eliminate the need to repeatedly merge with the main tree, etc. ...Robert
Simon Riggs wrote: > There are two remaining areas of significant thought/effort: Here's a list of other TODO items I've collected so far. Some of them are just improvements or nice-to-have stuff, but some are more serious: - If WAL recovery runs out of lock space while acquiring an AccessExclusiveLock on behalf of a transaction that ran in the master, it will FATAL and abort recovery, bringing down the standby. Seems like it should wait/cancel queries instead. - When switching from standby mode to normal operation, we momentarily hold all AccessExclusiveLocks held by prepared xacts twice, needing twice the lock space. You can run out of lock space at that point, causing failover to fail. - When replaying b-tree deletions, we currently wait out/cancel all running (read-only) transactions. We take the ultra-conservative stance because we don't know how recent the tuples being deleted are. If we could store a better estimate for latestRemovedXid in the WAL record, we could make that less conservative. - The assumption that b-tree vacuum records don't need conflict resolution because we did that with the additional cleanup-info record works ATM, but it hinges on the fact that we don't delete any tuples marked as killed while we do the vacuum. That seems like a low-hanging fruit that I'd actually like to do now that I spotted it, but will then need to fix b-tree vacuum records accordingly. We'd probably need to do something about the previous item first to keep performance acceptable. - There's the optimization to replay of b-tree vacuum records that we discussed earlier: Replay has to touch all leaf pages because of the interlock between heap scans, to ensure that we don't vacuum away a heap tuple that a concurrent index scan is about to visit. Instead of actually reading in and pinning all pages, during replay we could just check that the pages that don't need any other work to be done are not currently pinned in the buffer cache. - Do we do the b-tree page pinning explained in previous point correctly at the end of index vacuum? ISTM we're not visiting any pages after the last page that had dead tuples on it. - code structure. I moved much of the added code to a new standby.c module that now takes care of replaying standby related WAL records. But there's code elsewhere too. I'm not sure if this is a good division but seems better than the original ad hoc arrangement where e.g lock-related WAL handling was in inval.c - The "standby delay" is measured as current timestamp - timestamp of last replayed commit record. If there's little activity in the master, that can lead to surprising results. For example, imagine that max_standby_delay is set to 8 hours. The standby is fully up-to-date with the master, and there's no write activity in master. After 10 hours, a long reporting query is started in the standby. Ten minutes later, a small transaction is executed in the master that conflicts with the reporting query. I would expect the reporting query to be canceled 8 hours after the conflicting transaction began, but it is in fact canceled immediately, because it's over 8 hours since the last commit record was replayed. - ResolveRecoveryConflictWithVirtualXIDs polls until the victim transactions have ended. It would be much nicer to sleep. We'd need a version of LockAcquire with a timeout. Hmm, IIRC someone submitted a patch for lock timeouts recently. Maybe we could borrow code from that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > There are two remaining areas of significant thought/effort: > > Here's a list of other TODO items I've collected so far. Some of them > are just improvements or nice-to-have stuff, but some are more serious: > > - If WAL recovery runs out of lock space while acquiring an > AccessExclusiveLock on behalf of a transaction that ran in the master, > it will FATAL and abort recovery, bringing down the standby. Seems like > it should wait/cancel queries instead. Hard resources will always be an issue. If the standby has less than it needs, then there will be problems. All of those can be corrected by increasing the resources on the standby and restarting. This effects max_connections, max_prepared_transactions, max_locks_per_transaction, as documented. > - When switching from standby mode to normal operation, we momentarily > hold all AccessExclusiveLocks held by prepared xacts twice, needing > twice the lock space. You can run out of lock space at that point, > causing failover to fail. That was the issue I mentioned. > - When replaying b-tree deletions, we currently wait out/cancel all > running (read-only) transactions. We take the ultra-conservative stance > because we don't know how recent the tuples being deleted are. If we > could store a better estimate for latestRemovedXid in the WAL record, we > could make that less conservative. Exactly my point. There are already parts of the patch that may cause usage problems and need further thought. The earlier we get this to people the earlier we will find out what they all are and begin doing something about them. > - The assumption that b-tree vacuum records don't need conflict > resolution because we did that with the additional cleanup-info record > works ATM, but it hinges on the fact that we don't delete any tuples > marked as killed while we do the vacuum. That seems like a low-hanging > fruit that I'd actually like to do now that I spotted it, but will then > need to fix b-tree vacuum records accordingly. We'd probably need to do > something about the previous item first to keep performance acceptable. > > - There's the optimization to replay of b-tree vacuum records that we > discussed earlier: Replay has to touch all leaf pages because of the > interlock between heap scans, to ensure that we don't vacuum away a heap > tuple that a concurrent index scan is about to visit. Instead of > actually reading in and pinning all pages, during replay we could just > check that the pages that don't need any other work to be done are not > currently pinned in the buffer cache. Yes, its an optimization. Not one I consider critical, yet cool and interesting. > - Do we do the b-tree page pinning explained in previous point correctly > at the end of index vacuum? ISTM we're not visiting any pages after the > last page that had dead tuples on it. Looks like a new bug, not previously mentioned. > - code structure. I moved much of the added code to a new standby.c > module that now takes care of replaying standby related WAL records. But > there's code elsewhere too. I'm not sure if this is a good division but > seems better than the original ad hoc arrangement where e.g lock-related > WAL handling was in inval.c > - The "standby delay" is measured as current timestamp - timestamp of > last replayed commit record. If there's little activity in the master, > that can lead to surprising results. For example, imagine that > max_standby_delay is set to 8 hours. The standby is fully up-to-date > with the master, and there's no write activity in master. After 10 > hours, a long reporting query is started in the standby. Ten minutes > later, a small transaction is executed in the master that conflicts with > the reporting query. I would expect the reporting query to be canceled 8 > hours after the conflicting transaction began, but it is in fact > canceled immediately, because it's over 8 hours since the last commit > record was replayed. An issue that will be easily fixable with streaming, since it effectively needs a heartbeat to listen to. Adding a regular stream of WAL records is also possible, but there is no need, unless streaming is somehow in doubt. Again, there is work to do once both are in. > - ResolveRecoveryConflictWithVirtualXIDs polls until the victim > transactions have ended. It would be much nicer to sleep. We'd need a > version of LockAcquire with a timeout. Hmm, IIRC someone submitted a > patch for lock timeouts recently. Maybe we could borrow code from that? Nice? -- Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> - The "standby delay" is measured as current timestamp - timestamp of >> last replayed commit record. If there's little activity in the master, >> that can lead to surprising results. For example, imagine that >> max_standby_delay is set to 8 hours. The standby is fully up-to-date >> with the master, and there's no write activity in master. After 10 >> hours, a long reporting query is started in the standby. Ten minutes >> later, a small transaction is executed in the master that conflicts with >> the reporting query. I would expect the reporting query to be canceled 8 >> hours after the conflicting transaction began, but it is in fact >> canceled immediately, because it's over 8 hours since the last commit >> record was replayed. > > An issue that will be easily fixable with streaming, since it > effectively needs a heartbeat to listen to. Adding a regular stream of > WAL records is also possible, but there is no need, unless streaming is > somehow in doubt. Again, there is work to do once both are in. I don't think you need a heartbeat to solve this particular case. You just need to define the "standby delay" to be "current timestamp - timestamp of the conflicting candidate commit record". -- greg
Simon Riggs wrote: > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: >> - If WAL recovery runs out of lock space while acquiring an >> AccessExclusiveLock on behalf of a transaction that ran in the master, >> it will FATAL and abort recovery, bringing down the standby. Seems like >> it should wait/cancel queries instead. > > Hard resources will always be an issue. If the standby has less than it > needs, then there will be problems. All of those can be corrected by > increasing the resources on the standby and restarting. This effects > max_connections, max_prepared_transactions, max_locks_per_transaction, > as documented. There's no safe setting for those that would let you avoid the issue. No matter how high you set them, it will be possible for read-only backends to consume all the lock space, causing recovery to abort and bring down the standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 14:47 +0000, Greg Stark wrote: > On Sun, Nov 15, 2009 at 2:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> - The "standby delay" is measured as current timestamp - timestamp of > >> last replayed commit record. If there's little activity in the master, > >> that can lead to surprising results. For example, imagine that > >> max_standby_delay is set to 8 hours. The standby is fully up-to-date > >> with the master, and there's no write activity in master. After 10 > >> hours, a long reporting query is started in the standby. Ten minutes > >> later, a small transaction is executed in the master that conflicts with > >> the reporting query. I would expect the reporting query to be canceled 8 > >> hours after the conflicting transaction began, but it is in fact > >> canceled immediately, because it's over 8 hours since the last commit > >> record was replayed. > > > > An issue that will be easily fixable with streaming, since it > > effectively needs a heartbeat to listen to. Adding a regular stream of > > WAL records is also possible, but there is no need, unless streaming is > > somehow in doubt. Again, there is work to do once both are in. > > I don't think you need a heartbeat to solve this particular case. You > just need to define the "standby delay" to be "current timestamp - > timestamp of the conflicting candidate commit record". That's not possible unfortunately. We only have times for commits and aborts. So there could be untimed WAL records ahead of the last timed record. The times of events we know from the log records give us no clue as to when the last non-commit/abort record arrived. We can only do that by (i) specifically augmenting the log with regular, timed WAL records, or (ii) asking WALreceiver directly when it last spoke with the master (ii) is the obvious way to do this when we have streaming replication, and HS assumes this will be available. It need not, and we can do (i) Heikki's case is close to one I would expect to see in many cases: a database that is only active during day feeds a system that runs queries 24x7. Run a VACUUM on the master at night and you could get conflicts that follow the pattern described. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-11-15 at 16:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > >> - If WAL recovery runs out of lock space while acquiring an > >> AccessExclusiveLock on behalf of a transaction that ran in the master, > >> it will FATAL and abort recovery, bringing down the standby. Seems like > >> it should wait/cancel queries instead. > > > > Hard resources will always be an issue. If the standby has less than it > > needs, then there will be problems. All of those can be corrected by > > increasing the resources on the standby and restarting. This effects > > max_connections, max_prepared_transactions, max_locks_per_transaction, > > as documented. > > There's no safe setting for those that would let you avoid the issue. No > matter how high you set them, it will be possible for read-only backends > to consume all the lock space, causing recovery to abort and bring down > the standby. It can still fail even after we kick everybody off. So why bother? Most people run nowhere near the size limit of their lock tables, and on the standby we only track AccessExclusiveLocks in the Startup process. We gain little by spending time on partial protection against an unlikely issue. (BTW, I'm not suggesting you commit HS immediately. Only that we split into phases, stabilise and test pase 1 soon, then fix the remaining issues later.) -- Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 9:50 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon Riggs wrote: >> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: >>> - If WAL recovery runs out of lock space while acquiring an >>> AccessExclusiveLock on behalf of a transaction that ran in the master, >>> it will FATAL and abort recovery, bringing down the standby. Seems like >>> it should wait/cancel queries instead. >> >> Hard resources will always be an issue. If the standby has less than it >> needs, then there will be problems. All of those can be corrected by >> increasing the resources on the standby and restarting. This effects >> max_connections, max_prepared_transactions, max_locks_per_transaction, >> as documented. > > There's no safe setting for those that would let you avoid the issue. No > matter how high you set them, it will be possible for read-only backends > to consume all the lock space, causing recovery to abort and bring down > the standby. OK, but... there will always be things that will bring down the stand-by, just as there will always be things that can bring down the primary. A bucket of ice-water will probably do it, for example. I mean, it would be great to make it better, but is it so bad that we can't postpone that improvement to a follow-on patch? It's not clear to me that it is. I think we should really focus in on things that are likely to make this (1) give wrong answers or (2) won't be able to be fixed in a follow-on patch if they're not right in the original one. Only one or two of the items on your list of additional TODOs seem like they might fall into category (2), and none of them appear to fall into category (1). I predict that if we commit a basic version of this with some annoying limitations for 8.5, people who need the feature will start writing patches to address some of the limitations. No one else is going to undertake any serious development work as long as this remains outside the main tree, for fear of everything changing under them and all their work being wasted. I would like this feature to be as good as possible, but I would like to have it at all more. ...Robert
Robert Haas wrote: > OK, but... there will always be things that will bring down the > stand-by, just as there will always be things that can bring down the > primary. A bucket of ice-water will probably do it, for example. I > mean, it would be great to make it better, but is it so bad that we > can't postpone that improvement to a follow-on patch? We're not talking about a bucket of ice-water. We're talking about issuing SELECTs to a lot of different tables in a single transaction. > Only one or two of the items on your list of additional TODOs > seem like they might fall into category (2), and none of them appear > to fall into category (1). At least the b-tree vacuum bug could cause incorrect answers, even though it would be extremely hard to run into it in practice. > I predict that if we commit a basic version of this with some annoying > limitations for 8.5, people who need the feature will start writing > patches to address some of the limitations. No one else is going to > undertake any serious development work as long as this remains outside > the main tree, for fear of everything changing under them and all > their work being wasted. I would like this feature to be as good as > possible, but I would like to have it at all more. Agreed. Believe me, I'd like to have this committed as much as everyone else. But once I do that, I'm also committing myself to fix all the remaining issues before the release. The criteria for committing is: is it good enough that we could release it tomorrow with no further changes? Nothing more, nothing less. We have *already* postponed a lot of nice-to-have stuff like the functions to control recovery. And yes, many of the things I listed in the TODO are not must-haves and we could well release without them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > The assumption that b-tree vacuum records don't need conflict > resolution because we did that with the additional cleanup-info record > works ATM, but it hinges on the fact that we don't delete any tuples > marked as killed while we do the vacuum. > That seems like a low-hanging > fruit that I'd actually like to do now that I spotted it, but will > then need to fix b-tree vacuum records accordingly. We'd probably need > to do something about the previous item first to keep performance > acceptable. We can optimise that by using the xlog pointer of the HeapInfo record. Any blocks cleaned that haven't been further updated can avoid generating further btree deletion records. If you do this the straightforward way then it will just generate a stream of btree deletion records that will ruin usability. You spotted this issue only this morning?? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > >> The assumption that b-tree vacuum records don't need conflict >> resolution because we did that with the additional cleanup-info record >> works ATM, but it hinges on the fact that we don't delete any tuples >> marked as killed while we do the vacuum. > >> That seems like a low-hanging >> fruit that I'd actually like to do now that I spotted it, but will >> then need to fix b-tree vacuum records accordingly. We'd probably need >> to do something about the previous item first to keep performance >> acceptable. > > We can optimise that by using the xlog pointer of the HeapInfo record. > Any blocks cleaned that haven't been further updated can avoid > generating further btree deletion records. Sorry, I don't understand that. (Remember that marking index tuples as killed is not WAL-logged.) > You spotted this issue only this morning?? Yesterday evening. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > > > >> The assumption that b-tree vacuum records don't need conflict > >> resolution because we did that with the additional cleanup-info record > >> works ATM, but it hinges on the fact that we don't delete any tuples > >> marked as killed while we do the vacuum. > > > >> That seems like a low-hanging > >> fruit that I'd actually like to do now that I spotted it, but will > >> then need to fix b-tree vacuum records accordingly. We'd probably need > >> to do something about the previous item first to keep performance > >> acceptable. > > > > We can optimise that by using the xlog pointer of the HeapInfo record. > > Any blocks cleaned that haven't been further updated can avoid > > generating further btree deletion records. > > Sorry, I don't understand that. (Remember that marking index tuples as > killed is not WAL-logged.) Remember that blocks are marked with an LSN? When we insert a WAL record it has an LSN also. So we can tell which btree blocks might have had been written to after the HeapInfo record is generated. So if a block hasn't been recently updated or it doesn't have any killed tuples then we need not generate a record to handle a possible conflict. -- Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 11:49 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Agreed. Believe me, I'd like to have this committed as much as everyone > else. But once I do that, I'm also committing myself to fix all the > remaining issues before the release. The criteria for committing is: is > it good enough that we could release it tomorrow with no further > changes? Nothing more, nothing less. I agree with the criteria but I think their application to the present set of facts is debatable. If the b-tree vacuum bug can cause incorrect answers, then it is a bug and we have to fix it. But a query getting canceled because it touches a lot of tables sounds more like a limitation than an outright bug, and I'm not sure you should feel like you're on the hook for that, especially if the problem can be mitigated by adjusting settings. Of course, on the flip side, if the problem is likely to occur frequently enough to make the whole system unusable in practice, then maybe it does need to be fixed. I don't know. It's not my place and I don't intend to question your technical judgment on what does or does not need to be fixed, the moreso since I haven't read or thought deeply about the latest patch. I'm just throwing it out there. The other problem is that we have another big patch sitting right behind this one waiting for your attention as soon as you get this one off your chest. I know Simon has said that he feels that the effort to finish the HS and SR patches for 9/15 was somewhat of an artificial deadline, but ISTM that with only 3 months remaining until the close of the final CommitFest for this release, and two major patches to merged, we're starting to get tight on time. Presumably there will be problems with both patches that are discovered only after committing them, and we need some time for those to shake out. If not enough of that shaking out happens during the regular development cycle, it will either prolong beta and therefore delay the release, or the release will be buggy. All that having been said, the possibility that I'm a pessimistic worry-wort certainly can't be ruled out. :-) ...Robert
Simon Riggs wrote: > On Sun, 2009-11-15 at 19:36 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: >>> >>>> The assumption that b-tree vacuum records don't need conflict >>>> resolution because we did that with the additional cleanup-info record >>>> works ATM, but it hinges on the fact that we don't delete any tuples >>>> marked as killed while we do the vacuum. >>>> That seems like a low-hanging >>>> fruit that I'd actually like to do now that I spotted it, but will >>>> then need to fix b-tree vacuum records accordingly. We'd probably need >>>> to do something about the previous item first to keep performance >>>> acceptable. >>> We can optimise that by using the xlog pointer of the HeapInfo record. >>> Any blocks cleaned that haven't been further updated can avoid >>> generating further btree deletion records. >> Sorry, I don't understand that. (Remember that marking index tuples as >> killed is not WAL-logged.) > > Remember that blocks are marked with an LSN? When we insert a WAL record > it has an LSN also. So we can tell which btree blocks might have had > been written to after the HeapInfo record is generated. So if a block > hasn't been recently updated or it doesn't have any killed tuples then > we need not generate a record to handle a possible conflict. Hmm, perhaps we're talking about the same thing. What I'm seeing is that we could easily do this: *** a/src/backend/access/nbtree/nbtree.c --- b/src/backend/access/nbtree/nbtree.c *************** *** 843,855 **** restart: offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { IndexTuple itup; ItemPointer htup; ! itup = (IndexTuple) PageGetItem(page, ! PageGetItemId(page, offnum)); htup = &(itup->t_tid); ! if (callback(htup, callback_state)) deletable[ndeletable++] = offnum; } } --- 843,856 ---- offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { + ItemId itemid; IndexTuple itup; ItemPointer htup; ! itemid = PageGetItemId(page, offnum); ! itup = (IndexTuple) PageGetItem(page, itemid); htup = &(itup->t_tid); ! if (callback(htup, callback_state) || ItemIdIsDead(itemid)) deletable[ndeletable++]= offnum; } } But if we do that, b-tree vacuum records are going to need conflict resolution, just like the b-tree non-vacuum deletion records. The LSN doesn't help there, because when an itemid is marked as dead, the LSN is not updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Robert Haas wrote: > But a > query getting canceled because it touches a lot of tables sounds more > like a limitation than an outright bug, It's not that the query might get canceled. It will abort WAL recovery, kill all backends, and bring the whole standby down. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 13:15 -0500, Robert Haas wrote: > I know Simon has said that he feels that the effort > to finish the HS and SR patches for 9/15 was somewhat of an artificial > deadline, but ISTM that with only 3 months remaining until the close > of the final CommitFest for this release, and two major patches to > merged, we're starting to get tight on time. As of further concerns about initial snapshot conditions, I agree we are now tight on time. > Presumably there will be > problems with both patches that are discovered only after committing > them, and we need some time for those to shake out. If not enough of > that shaking out happens during the regular development cycle, it will > either prolong beta and therefore delay the release, or the release > will be buggy. I'm not worried about bugs. Fixes for those can go in anytime. Missing features and small usability enhancements will be forced to wait another year and cause upgrades for early adopters. That worries me. REL8_0 shipped with an unusable bgwriter implementation and I've always been wary of the need for minor tweaks late in a release since then. I've not asked for an immediate commit, but we do need an agreed period patch stability to allow testing, prior to a commit. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote: > The LSN doesn't help there, because when an itemid is marked as dead, > the LSN is not updated. I was thinking we could update the index block LSN without writing WAL using the LSN of the heap block that leads to the killed tuple. Pretending that the block might need flushing won't do much harm. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote: > >> The LSN doesn't help there, because when an itemid is marked as dead, >> the LSN is not updated. > > I was thinking we could update the index block LSN without writing WAL > using the LSN of the heap block that leads to the killed tuple. That can be before the cleanup record we write before we start the index vacuum. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 20:37 +0200, Heikki Linnakangas wrote: > Robert Haas wrote: > > But a > > query getting canceled because it touches a lot of tables sounds more > > like a limitation than an outright bug, > > It's not that the query might get canceled. It will abort WAL recovery, > kill all backends, and bring the whole standby down. Hmm, I think the incredible exploding Hot Standby is overstating this somewhat. We can improve the error handling for this rare case for which a simple workaround exists, but it seems like we should punt to phase 2. You agree there should be two phases? -- Simon Riggs www.2ndQuadrant.com
On Sun, Nov 15, 2009 at 7:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > You agree there should be two phases? > I don't understand this repeated suggestion of "phases". Nobody's every suggested that we would refuse to add new features to HS after the initial commit or the 8.5 release. Of course there should be later features if you or anyone else is interested in working on them. Or are asking whether we should commit it before it's a usable subset of the functionality? Personally I am in favour of earlier more fine-grained commits but I think the horse has left the stable on that one. We have a usable subset of the functionality in this patch already, don't we? -- greg
On Sun, 2009-11-15 at 21:20 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote: > > > >> The LSN doesn't help there, because when an itemid is marked as dead, > >> the LSN is not updated. > > > > I was thinking we could update the index block LSN without writing WAL > > using the LSN of the heap block that leads to the killed tuple. > > That can be before the cleanup record we write before we start the index > vacuum. Oh well. Strike 1. But the technique sounds OK, we just need to get the LSN of a HeapInfo record from somewhere, say, index metapage. Sounds like we need to do something similar with the xid. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > You agree there should be two phases? I'm hesitant to say 'yes', because then you will harass me with "but you said that you would be OK with fixing X, Y, Z later! Why don't you commit already!". Of course there should be several phases! We've *already* punted a lot of stuff from this first increment we're currently working on. The criteria for getting this first phase committed is: could we release with no further changes? If you actually want to help, can you please focus on fixing the must-fix bugs we know about? We can then discuss which of the remaining known issues we're willing to live with. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Sun, 2009-11-15 at 21:20 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Sun, 2009-11-15 at 20:30 +0200, Heikki Linnakangas wrote: >>> >>>> The LSN doesn't help there, because when an itemid is marked as dead, >>>> the LSN is not updated. >>> I was thinking we could update the index block LSN without writing WAL >>> using the LSN of the heap block that leads to the killed tuple. >> That can be before the cleanup record we write before we start the index >> vacuum. > > Oh well. Strike 1. > > But the technique sounds OK, we just need to get the LSN of a HeapInfo > record from somewhere, say, index metapage. Sounds like we need to do > something similar with the xid. I'm thinking that we should address the general issue, not just with vacuum-related deletion records. For the vacuum-related deletion records, we can just leave the code as it is. I think we talked about various approaches about a year ago when we first realized that killed index tuples are a problem, though I don't think we carved out a full solution. We could for example stored the xmax (or xmin if it was inserted by an aborted transaction) of the killed tuple in the b-tree page header whenever we mark an index tuple as dead. We could then include that in the WAL record. The trick is how to make that crash-safe. (but this whole thing is certainly something we can defer until later) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 21:56 +0200, Heikki Linnakangas wrote: > If you actually want to help, can you please focus on fixing the > must-fix bugs we know about? We can then discuss which of the > remaining known issues we're willing to live with. I intend to work on all of the issues, so not sure what you mean by help. When the role of author and reviewer becomes blurred it gets harder to work together, for certain. Since we are short of time and some issues will take time, the priority order of further work is important. Right now, I don't know which you consider to be the must-fix issues, hence the thread. I also don't know what you consider to be appropriate fixes to them, so unfortunately there will be more talking until it is time for action. I prefer coding, just like you. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > Right now, I don't know which you > consider to be the must-fix issues, hence the thread. Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the index pages after the last b-tree vacuum record? Thanks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Right now, I don't know which you > > consider to be the must-fix issues, hence the thread. > > Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the > index pages after the last b-tree vacuum record? Thanks. That's all? You sure? -- Simon Riggs www.2ndQuadrant.com
On 11/15/09 12:58 PM, Simon Riggs wrote: > On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> Right now, I don't know which you >>> consider to be the must-fix issues, hence the thread. >> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the >> index pages after the last b-tree vacuum record? Thanks. > > That's all? You sure? Just speaking from a user/tester perspective, a HS with known caveats and failure conditions would be acceptable in Alpha3. It would be better than waiting for Alpha4. Not only would getting some form of HS into Alpha3 get people testing HS and finding failure conditions we didn't think of eariler, it will also inspire people to compile and test the Alphas, period. Right now the whole Alpha testing program seems to have only attracted The Usual Contributors, despite efforts to publicize it. So I'm in favor of committing part of the HS code even if there are known failure conditions, as long as those conditions are well-defined. (and applause to Simon and Heikki for continuing to put noses to grinstones on this, and Robert for keeping an eye on the schedule) --Josh Berkus
Simon Riggs wrote: > On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> Right now, I don't know which you >>> consider to be the must-fix issues, hence the thread. >> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the >> index pages after the last b-tree vacuum record? Thanks. > > That's all? You sure? For starters. If you think you'll get that done quickly, please take a look at the "bucket of ice-water" issue next. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 23:14 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> Right now, I don't know which you > >>> consider to be the must-fix issues, hence the thread. > >> Ok, could you tackle the b-tree vacuum bug, where we neglect to pin the > >> index pages after the last b-tree vacuum record? Thanks. > > > > That's all? You sure? > > For starters. If you think you'll get that done quickly, please take a > look at the "bucket of ice-water" issue next. Sure, I'll see if I can reach for the bucket. -- Simon Riggs www.2ndQuadrant.com
Josh Berkus <josh@agliodbs.com> writes: > So I'm in favor of committing part of the HS code even if there are > known failure conditions, as long as those conditions are well-defined. If we're thinking of committing something that is known broken, I would want to have a clearly defined and trust-inspiring escape strategy. "We can always revert the patch later" inspires absolutely zero confidence here, because in a patch this large there are always going to be overlaps with other later patches. If it gets to be February and HS is still unshippable, reverting is going to be a tricky and risky affair. I agree with Heikki that it would be better not to commit as long as any clear showstoppers remain unresolved. regards, tom lane
On Nov 15, 2009, at 4:19 PM, Simon Riggs <simon@2ndQuadrant.com> wrote: > On Sun, 2009-11-15 at 23:14 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Sun, 2009-11-15 at 22:45 +0200, Heikki Linnakangas wrote: >>>> Simon Riggs wrote: >>>>> Right now, I don't know which you >>>>> consider to be the must-fix issues, hence the thread. >>>> Ok, could you tackle the b-tree vacuum bug, where we neglect to >>>> pin the >>>> index pages after the last b-tree vacuum record? Thanks. >>> >>> That's all? You sure? >> >> For starters. If you think you'll get that done quickly, please >> take a >> look at the "bucket of ice-water" issue next. > > Sure, I'll see if I can reach for the bucket. Me and my big fat mouth... ...Robert
On Nov 15, 2009, at 2:17 PM, Tom Lane wrote: >> So I'm in favor of committing part of the HS code even if there are >> known failure conditions, as long as those conditions are well-defined. > > If we're thinking of committing something that is known broken, I would > want to have a clearly defined and trust-inspiring escape strategy. > "We can always revert the patch later" inspires absolutely zero > confidence here, because in a patch this large there are always going to > be overlaps with other later patches. If it gets to be February and HS > is still unshippable, reverting is going to be a tricky and risky > affair. > > I agree with Heikki that it would be better not to commit as long as > any clear showstoppers remain unresolved. If ever there were an argument for topic branches, *this is it*. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > On Nov 15, 2009, at 2:17 PM, Tom Lane wrote: >> I agree with Heikki that it would be better not to commit as long as >> any clear showstoppers remain unresolved. > If ever there were an argument for topic branches, *this is it*. How so? They've got a perfectly good topic branch, ie, the external git repository they're already working in. If the branch were within core CVS it would accomplish exactly nothing more as far as easing the eventual merge. regards, tom lane
Just a question: - Does Hot Standby allow to use prepared query (not prepared transaction) in standby? I mean: Parse message from frontendcan be accepted by standby? - Can we create tempory tables in standby? -- Tatsuo Ishii SRA OSS, Inc. Japan > After some time thinking about the best way forward for Hot Standby, I > have some observations and proposals. > > First, the project is very large. We have agreed ways to trim the patch, > yet it remains large. Trying to do everything in one lump is almost > always a bad plan, so we need to phase things. > > Second, everybody is keen that HS hits the tree, so we can have alpha > code etc.. There are a few remaining issues that should *not* be rushed. > The only way to remove this dependency is to decouple parts of the > project. > > Third, testing the patch is difficult and continuous change makes it > harder to guarantee everything is working. > > There are two remaining areas of significant thought/effort: > > * Issues relating to handling of prepared transactions > * How fast Hot Standby mode is enabled in the standby > > I propose that we stabilise and eventually commit a version of HS that > circumvents/defers those issues and then address the issues with > separate patches afterwards. This approach will allow us to isolate the > areas of further change so we can have a test blitz to remove silly > mistakes, then follow it with a commit to CVS, and then release as Alpha > to allow further testing. > > Let's look at the two areas of difficulty in more detail > > * Issues relating to handling of prepared transactions > There are some delicate issues surrounding what happens at the end of > recovery if there is a prepared transaction still holding an access > exclusive lock. It is straightforward to say, as an interim measure, > "Hot Standby will not work with max_prepared_transactions > 0". I see > that this has a fiddly, yet fairly clear solution. > > * How fast Hot Standby mode is enabled in the standby > We need to have full snapshot information on the standby before we can > allow connections and queries. There are two basic approaches: i) we > wait until we *know* we have full info or ii) we try to collect data and > inject a correct starting condition. Waiting (i) may take a while, but > is clean and requires only a few lines of code. Injecting the starting > condition (ii) requires boatloads of hectic code and we have been unable > to agree a way forwards. If we did have that code, all it would give us > is a faster/more reliable starting point for connections on the standby. > Until we can make approach (ii) work, we should just rely on the easy > approach (i). In many cases, the starting point is very similar. (In > some cases we can actually make (i) faster because the overhead of data > collection forces us to derive the starting conditions minutes apart.) > > Phasing the commit seems like the only way. > > Please can we agree a way forwards? > > -- > Simon Riggs www.2ndQuadrant.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Tatsuo Ishii wrote: > Just a question: > > - Does Hot Standby allow to use prepared query (not prepared > transaction) in standby? I mean: Parse message from frontend can be > accepted by standby? Yes. > - Can we create tempory tables in standby? No, because creating a temporary table needs to write to the catalogs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Tatsuo Ishii wrote: > > Just a question: > > > > - Does Hot Standby allow to use prepared query (not prepared > > transaction) in standby? I mean: Parse message from frontend can be > > accepted by standby? > > Yes. In my understanding, Parse will aquire locks on the target table. Is this harmless? -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii wrote: > In my understanding, Parse will aquire locks on the target table. Is > this harmless? That's ok, you can take AccessShareLocks in a standby. All queries lock the target table (in AccessShare mode). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-11-16 at 13:23 +0900, Tatsuo Ishii wrote: > Just a question: > > - Does Hot Standby allow to use prepared query (not prepared > transaction) in standby? I mean: Parse message from frontend can be > accepted by standby? Yes, no problem with any of those kind of facilities > - Can we create tempory tables in standby? No, but this is for two reasons * CREATE TEMPORARY TABLE actually writes to catalog tables. It doesn't need to do that, so allowing this would require some medium-heavy lifting of the way temp tables work. A preliminary design was agreed in July 2008. I believe it would be a popular feature, since about 40-50% of people ask for this. * CREATE TEMP TABLE is currently considered to be disallowed during read only transactions. That might be able to change if the underlying physical operation were write-free. -- Simon Riggs www.2ndQuadrant.com
> > - Does Hot Standby allow to use prepared query (not prepared > > transaction) in standby? I mean: Parse message from frontend can be > > accepted by standby? > > Yes, no problem with any of those kind of facilities Please correct me if I'm wrong. Parse will result in obtaining RowExclusiveLock on the target table if it is parsing INSERT/UPDATE/DELETE. If so, is this ok in the standby? > > - Can we create tempory tables in standby? > > No, but this is for two reasons > > * CREATE TEMPORARY TABLE actually writes to catalog tables. It doesn't > need to do that, so allowing this would require some medium-heavy > lifting of the way temp tables work. A preliminary design was agreed in > July 2008. I believe it would be a popular feature, since about 40-50% > of people ask for this. > > * CREATE TEMP TABLE is currently considered to be disallowed during read > only transactions. That might be able to change if the underlying > physical operation were write-free. Thanks for explanation. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tom Lane wrote: > I agree with Heikki that it would be better not to commit as long as > any clear showstoppers remain unresolved. I agree that it would be better not to commit as long as any of the following are true: (1) There are any known issues which would break things for clusters *not using* hot standby. (2) There isn't an easy way for to disable configuration of hot standby. (3) There is significant doubt that the vast majority of the patch will be useful in the eventually-enabled final solution. If none of these are true, I'm not sure what the down side of a commit is. -Kevin
On Mon, Nov 16, 2009 at 11:07 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Tom Lane wrote: > >> I agree with Heikki that it would be better not to commit as long as >> any clear showstoppers remain unresolved. > > I agree that it would be better not to commit as long as any of the > following are true: > > (1) There are any known issues which would break things for clusters > *not using* hot standby. > > (2) There isn't an easy way for to disable configuration of hot > standby. > > (3) There is significant doubt that the vast majority of the patch > will be useful in the eventually-enabled final solution. > > If none of these are true, I'm not sure what the down side of a commit > is. Well, I think you wouldn't want to commit something that enabled Hot Standby but caused Hot Standby queries to give wrong answers, or didn't even allow some/all queries to be executed. That's fairly pointless, and might mislead users into thinking we had a feature when we really didn't. ...Robert
On Mon, 2009-11-16 at 19:06 +0900, Tatsuo Ishii wrote: > > > - Does Hot Standby allow to use prepared query (not prepared > > > transaction) in standby? I mean: Parse message from frontend can be > > > accepted by standby? > > > > Yes, no problem with any of those kind of facilities > > Please correct me if I'm wrong. Parse will result in obtaining > RowExclusiveLock on the target table if it is parsing > INSERT/UPDATE/DELETE. If so, is this ok in the standby? Any attempt to take RowExclusiveLock will fail. Any attempt to execute INSERT/UPDATE/DELETE will fail. This behaviour should be identical to read only transaction mode. If it is not documented as an exception, please report as a bug. -- Simon Riggs www.2ndQuadrant.com
> > Please correct me if I'm wrong. Parse will result in obtaining > > RowExclusiveLock on the target table if it is parsing > > INSERT/UPDATE/DELETE. If so, is this ok in the standby? > > Any attempt to take RowExclusiveLock will fail. > > Any attempt to execute INSERT/UPDATE/DELETE will fail. > > This behaviour should be identical to read only transaction mode. If it > is not documented as an exception, please report as a bug. Is it? It seems read only transaction mode is perfectly happy with RowExclusiveLock: test=# begin; BEGIN test=# set transaction read only; SET test=# prepare a(int) as insert into t1 values($1); PREPARE test=# \x Expanded display is on. test=# select * from pg_locks; -[ RECORD 1 ]------+----------------- locktype | relation database | 1297143 relation | 10969 page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | 1/101699 pid | 28020 mode | AccessShareLock granted | t -[ RECORD 2 ]------+----------------- locktype | virtualxid database | relation | page | tuple | virtualxid | 1/101699 transactionid | classid | objid | objsubid | virtualtransaction | 1/101699 pid | 28020 mode | ExclusiveLock granted | t -[ RECORD 3 ]------+----------------- locktype | relation database | 1297143 relation | 1574918 page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | 1/101699 pid | 28020 mode | RowExclusiveLock granted | t test=# select relname from pg_class where oid = 1574918; -[ RECORD 1 ] relname | t1 -- Tatsuo Ishii SRA OSS, Inc. Japan
Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 16, 2009 at 11:07 AM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Tom Lane wrote: >> >>> I agree with Heikki that it would be better not to commit as long >>> as any clear showstoppers remain unresolved. >> >> I agree that it would be better not to commit as long as any of the >> following are true: >> >> (1) There are any known issues which would break things for >> clusters *not using* hot standby. >> >> (2) There isn't an easy way for to disable configuration of hot >> standby. >> >> (3) There is significant doubt that the vast majority of the patch >> will be useful in the eventually-enabled final solution. >> >> If none of these are true, I'm not sure what the down side of a >> commit is. > > Well, I think you wouldn't want to commit something that enabled Hot > Standby but caused Hot Standby queries to give wrong answers, or > didn't even allow some/all queries to be executed. That's fairly > pointless, and might mislead users into thinking we had a feature > when we really didn't. I might. Based on my project management experience and the tone of the posts on this feature, I suspect that there would be benefit to committing the code -- even if the ability to enable it was commented out. For starters, I suspect that most people won't be using it, so the most important thing for most users is that the patch breaks nothing when the feature is not configured. Also, if we have high confidence that the vast majority of this code will eventually be committed, and likely in 8.5, the sooner it is what people work against, the less likely that a late change will destabilize something. Having made that point, I'm happy to leave it to Heikki's judgment; it's definitely not something I care enough about to argue at any length.... -Kevin
Tatsuo Ishii wrote: >>> Please correct me if I'm wrong. Parse will result in obtaining >>> RowExclusiveLock on the target table if it is parsing >>> INSERT/UPDATE/DELETE. If so, is this ok in the standby? >> Any attempt to take RowExclusiveLock will fail. >> >> Any attempt to execute INSERT/UPDATE/DELETE will fail. >> >> This behaviour should be identical to read only transaction mode. If it >> is not documented as an exception, please report as a bug. > > Is it? > > It seems read only transaction mode is perfectly happy with > RowExclusiveLock: Hmm, that's a good point. I can't immediately see that that would cause any trouble, but it gives me an uncomfortable feeling about the locking. Which locks exactly need to be replayed in standby, and why? Which locks can read-only transactions acquire? The doc says: + In recovery, transactions will not be permitted to take any table lock + higher than AccessShareLock. In addition, transactions may never assign + a TransactionId and may never write WAL. + Any LOCK TABLE command that runs on the standby and requests a specific + lock type other than AccessShareLock will be rejected. which seems wrong, given Tatsuo-sans example. Is that paragraph only referring to LOCK TABLE, and not other means of acquiring locks? Either way, it needs to be clarified or fixed. access/transam/README says: +Further details on locking mechanics in recovery are given in comments +with the Lock rmgr code. but there's no explanation there either *why* the locking works as it is. In LockAcquire(), we forbid taking locks higher than AccessShareLock during recovery mode, but only for LOCKTAG_OBJECT locks. Why? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-11-18 at 14:51 +0200, Heikki Linnakangas wrote: > Tatsuo Ishii wrote: > >>> Please correct me if I'm wrong. Parse will result in obtaining > >>> RowExclusiveLock on the target table if it is parsing > >>> INSERT/UPDATE/DELETE. If so, is this ok in the standby? > >> Any attempt to take RowExclusiveLock will fail. > >> > >> Any attempt to execute INSERT/UPDATE/DELETE will fail. > >> > >> This behaviour should be identical to read only transaction mode. If it > >> is not documented as an exception, please report as a bug. > > > > Is it? > > > > It seems read only transaction mode is perfectly happy with > > RowExclusiveLock: > > Hmm, that's a good point. I can't immediately see that that would cause > any trouble, but it gives me an uncomfortable feeling about the locking. > Which locks exactly need to be replayed in standby, and why? Which locks > can read-only transactions acquire? > > The doc says: > + In recovery, transactions will not be permitted to take any table lock > + higher than AccessShareLock. In addition, transactions may never assign > + a TransactionId and may never write WAL. > + Any LOCK TABLE command that runs on the standby and requests a specific > + lock type other than AccessShareLock will be rejected. > > which seems wrong, given Tatsuo-sans example. Is that paragraph only > referring to LOCK TABLE, and not other means of acquiring locks? Either > way, it needs to be clarified or fixed. > > access/transam/README says: > +Further details on locking mechanics in recovery are given in comments > +with the Lock rmgr code. > > but there's no explanation there either *why* the locking works as it > is. In LockAcquire(), we forbid taking locks higher than AccessShareLock > during recovery mode, but only for LOCKTAG_OBJECT locks. Why? Recovery does *not* take the same locks as the original statements on the master took. For example, the WAL record for an INSERT just makes its changes without acquiring locks. This is OK as long as we only allow read-only users to acquire AccessShareLocks. If we allowed higher locks we might need to do deadlock detection, which would add more complexity. The above restrictions are limited to LOCKTAG_OBJECT so that advisory locks work as advertised. So advisory locks can take both shared and exclusive locks. This never conflicts with recovery because advisory locks are not WAL logged. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > Recovery does *not* take the same locks as the original statements on > the master took. For example, the WAL record for an INSERT just makes > its changes without acquiring locks. This is OK as long as we only allow > read-only users to acquire AccessShareLocks. If we allowed higher locks > we might need to do deadlock detection, which would add more complexity. But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans example shows. Is that a bug? > The above restrictions are limited to LOCKTAG_OBJECT so that advisory > locks work as advertised. So advisory locks can take both shared and > exclusive locks. This never conflicts with recovery because advisory > locks are not WAL logged. So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes advisory locks, but also relation locks, tuple locks and page locks. Looking at the lock types in detail: LOCKTAG_RELATION Any lock level is allowed. We have other defenses against actually modifying a relation, but it feels a bit fragile and I got the impression from your comments that it's not intentional. LOCKTAG_RELATION_EXTEND Any lock level is allowed. Again, we have other defenses against modifying relations, but feels fragile. LOCKTAG_PAGE Any lock level is allowed. Page locks are only used when extending a hash index, so it seems irrelevant what we do. I think we should disallow page locks in standby altogether. LOCKTAG_TUPLE, Any lock level is allowed. Only used when locking a tuple for update. We forbid locking tuples by the general "is the transaction read-only?" check in executor, and if you manage to bypass that, you will fail to get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple locks. LOCKTAG_TRANSACTION, Any lock level is allowed. Acquired in AssignTransactionId, to allow others to wait for the transaction to finish. We don't allow AssignTransactionId() during recovery, but could someone want to wait for a transaction to finish? All the current callers of XactLockTableWait() seem to be from operations that are not allowed in recovery. Should we take a conservative stance and disallow taking transaction-locks? LOCKTAG_VIRTUALTRANSACTION Any lock level is allowed. Similar to transaction locks, but virtual transaction locks are held by read-only transactions as well. Also during recovery, and we rely on it in the code to wait for a conflicting transaction to finish. But we don't acquire locks to represent transactions in master. LOCKTAG_OBJECT, Anything higher than AccessShareLock is disallowed. Used by dependency walking in pg_depend.c. Also used as interlock between database start and DROP/CREATE DATABASE. At backend start, we normally take RowExclusiveLock on the database in postinit.c, but you had to modify to acquire AccessShareLock instead in standby mode. LOCKTAG_USERLOCK LOCKTAG_ADVISORY Any lock level is allowed. As documented, advisory locks are per-server, so a lock taken in master doesn't conflict with one taken in slave. In any case, all this really needs to be documented in a README or something. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Simon Riggs wrote: > > Recovery does *not* take the same locks as the original statements on > > the master took. For example, the WAL record for an INSERT just makes > > its changes without acquiring locks. This is OK as long as we only allow > > read-only users to acquire AccessShareLocks. If we allowed higher locks > > we might need to do deadlock detection, which would add more complexity. > > But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans > example shows. Is that a bug? Sorry for confusion. My example is under normal PostgreSQL, not under HS enabled. -- Tatsuo Ishii SRA OSS, Inc. Japan > > The above restrictions are limited to LOCKTAG_OBJECT so that advisory > > locks work as advertised. So advisory locks can take both shared and > > exclusive locks. This never conflicts with recovery because advisory > > locks are not WAL logged. > > So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes > advisory locks, but also relation locks, tuple locks and page locks. > > Looking at the lock types in detail: > > LOCKTAG_RELATION > > Any lock level is allowed. We have other defenses against actually > modifying a relation, but it feels a bit fragile and I got the > impression from your comments that it's not intentional. > > LOCKTAG_RELATION_EXTEND > > Any lock level is allowed. Again, we have other defenses against > modifying relations, but feels fragile. > > LOCKTAG_PAGE > > Any lock level is allowed. Page locks are only used when extending a > hash index, so it seems irrelevant what we do. I think we should > disallow page locks in standby altogether. > > LOCKTAG_TUPLE, > > Any lock level is allowed. Only used when locking a tuple for update. We > forbid locking tuples by the general "is the transaction read-only?" > check in executor, and if you manage to bypass that, you will fail to > get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple > locks. > > LOCKTAG_TRANSACTION, > > Any lock level is allowed. Acquired in AssignTransactionId, to allow > others to wait for the transaction to finish. We don't allow > AssignTransactionId() during recovery, but could someone want to wait > for a transaction to finish? All the current callers of > XactLockTableWait() seem to be from operations that are not allowed in > recovery. Should we take a conservative stance and disallow taking > transaction-locks? > > LOCKTAG_VIRTUALTRANSACTION > > Any lock level is allowed. Similar to transaction locks, but virtual > transaction locks are held by read-only transactions as well. Also > during recovery, and we rely on it in the code to wait for a conflicting > transaction to finish. But we don't acquire locks to represent > transactions in master. > > LOCKTAG_OBJECT, > > Anything higher than AccessShareLock is disallowed. Used by dependency > walking in pg_depend.c. Also used as interlock between database start > and DROP/CREATE DATABASE. At backend start, we normally take > RowExclusiveLock on the database in postinit.c, but you had to modify to > acquire AccessShareLock instead in standby mode. > > LOCKTAG_USERLOCK > LOCKTAG_ADVISORY > > Any lock level is allowed. As documented, advisory locks are per-server, > so a lock taken in master doesn't conflict with one taken in slave. > > > In any case, all this really needs to be documented in a README or > something. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com
Tatsuo Ishii wrote: > Sorry for confusion. My example is under normal PostgreSQL, not under > HS enabled. You get the same result in standby: postgres=# begin; BEGIN postgres=# prepare a(int) as insert into foo values($1); PREPARE postgres=# SELECT * FROM pg_locks; locktype │ database │ relation │ page │ tuple │ virtualxid │ transactionid │ classid │ objid │ objsubid │ virtualtransaction │ pid │ mode │ gra nted ────────────┼──────────┼──────────┼──────┼───────┼────────────┼───────────────┼─ ────────┼───────┼──────────┼────────────────────┼───────┼──────────────────┼──── ─────relation │ 11564 │ 10968 │ │ │ │ │ │ │ │ 2/4 │10449 │ AccessShareLock │ trelation │ 11564 │ 16384 │ │ │ │ │ │ │ │ 2/4 │ 10449 │ RowExclusiveLock │ tvirtualxid │ │ │ │ │ 1/1 │ │ │ │ │ 1/0 │ 10419 │ ExclusiveLock │ tvirtualxid │ │ │ │ │ 2/4 │ │ │ │ │ 2/4 │ 10449 │ ExclusiveLock │ t (4 rows) this is from a standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tatsuo Ishii wrote: >> Sorry for confusion. My example is under normal PostgreSQL, not under >> HS enabled. > You get the same result in standby: AFAICT Tatsuo's example just shows that we might wish to add a check for read-only transaction mode before parsing an INSERT/UPDATE/DELETE command. But it seems relatively minor in any case --- at the worst you'd get an unexpected error message, no? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tatsuo Ishii wrote: >>> Sorry for confusion. My example is under normal PostgreSQL, not under >>> HS enabled. > >> You get the same result in standby: > > AFAICT Tatsuo's example just shows that we might wish to add a check > for read-only transaction mode before parsing an INSERT/UPDATE/DELETE > command. But it seems relatively minor in any case --- at the worst > you'd get an unexpected error message, no? Right, it's harmless AFAICS. And it might actually be useful to be able to prepare all queries right after connecting, even though the connection is in not yet read-write. It's the documentation (in source code or README) that's lacking, and perhaps we should add more explicit checks for the "can't happen" cases, just in case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 11/15/09 11:07 PM, Heikki Linnakangas wrote: > - When replaying b-tree deletions, we currently wait out/cancel all > running (read-only) transactions. We take the ultra-conservative stance > because we don't know how recent the tuples being deleted are. If we > could store a better estimate for latestRemovedXid in the WAL record, we > could make that less conservative. Simon was explaining this issue here at JPUGCon; now that I understand it, this specific issue seems like the worst usability issue in HS now.Bad enough to kill its usefulness for users, or evenour ability to get useful testing data; in an OLTP production database with several hundred inserts per second it would result in pretty much never being able to get any query which takes longer than a few seconds to complete on the slave. --Josh Berkus
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote: > On 11/15/09 11:07 PM, Heikki Linnakangas wrote: > > - When replaying b-tree deletions, we currently wait out/cancel all > > running (read-only) transactions. We take the ultra-conservative stance > > because we don't know how recent the tuples being deleted are. If we > > could store a better estimate for latestRemovedXid in the WAL record, we > > could make that less conservative. > > Simon was explaining this issue here at JPUGCon; now that I understand > it, this specific issue seems like the worst usability issue in HS now. > Bad enough to kill its usefulness for users, or even our ability to get > useful testing data; in an OLTP production database with several hundred > inserts per second it would result in pretty much never being able to > get any query which takes longer than a few seconds to complete on the > slave. I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers that are doing more than that... This sounds pretty significant. Joshua D. Drake > > --Josh Berkus > > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Joshua D. Drake wrote: > On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote: > >> On 11/15/09 11:07 PM, Heikki Linnakangas wrote: >> >>> - When replaying b-tree deletions, we currently wait out/cancel all >>> running (read-only) transactions. We take the ultra-conservative stance >>> because we don't know how recent the tuples being deleted are. If we >>> could store a better estimate for latestRemovedXid in the WAL record, we >>> could make that less conservative. >>> >> Simon was explaining this issue here at JPUGCon; now that I understand >> it, this specific issue seems like the worst usability issue in HS now. >> Bad enough to kill its usefulness for users, or even our ability to get >> useful testing data; in an OLTP production database with several hundred >> inserts per second it would result in pretty much never being able to >> get any query which takes longer than a few seconds to complete on the >> slave. >> > > I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers > that are doing more than that... This sounds pretty significant. > > Right. The major use I was hoping for from HS was exactly to be able to run long-running queries. In once case I am thinking of we have moved the business intelligence uses off the OLTP server onto a londiste replica, and I was really wanting to move that to a Hot Standby server. cheers andrew
On Thu, 2009-11-19 at 10:13 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Recovery does *not* take the same locks as the original statements on > > the master took. For example, the WAL record for an INSERT just makes > > its changes without acquiring locks. This is OK as long as we only allow > > read-only users to acquire AccessShareLocks. If we allowed higher locks > > we might need to do deadlock detection, which would add more complexity. > > But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans > example shows. Is that a bug? > > > The above restrictions are limited to LOCKTAG_OBJECT so that advisory > > locks work as advertised. So advisory locks can take both shared and > > exclusive locks. This never conflicts with recovery because advisory > > locks are not WAL logged. > > So we allow any lock on anything *except* LOCKTAG_OBJECT. That includes > advisory locks, but also relation locks, tuple locks and page locks. > > Looking at the lock types in detail: > > LOCKTAG_RELATION > > Any lock level is allowed. We have other defenses against actually > modifying a relation, but it feels a bit fragile and I got the > impression from your comments that it's not intentional. Possibly fragile, will look further. LOCKTAG_OBJECT was the important one in testing. > LOCKTAG_RELATION_EXTEND > > Any lock level is allowed. Again, we have other defenses against > modifying relations, but feels fragile. This only ever happens after xid is assigned, which can never happen. Happy to add protection if you think so. > LOCKTAG_PAGE > > Any lock level is allowed. Page locks are only used when extending a > hash index, so it seems irrelevant what we do. I think we should > disallow page locks in standby altogether. As above, but OK. > LOCKTAG_TUPLE, > > Any lock level is allowed. Only used when locking a tuple for update. We > forbid locking tuples by the general "is the transaction read-only?" > check in executor, and if you manage to bypass that, you will fail to > get an XID to set to xmax. Nevertheless, seems we shouldn't allow tuple > locks. Specifically disallowed earlier when row marks queries are requested. > LOCKTAG_TRANSACTION, > > Any lock level is allowed. Acquired in AssignTransactionId, to allow > others to wait for the transaction to finish. We don't allow > AssignTransactionId() during recovery, but could someone want to wait > for a transaction to finish? All the current callers of > XactLockTableWait() seem to be from operations that are not allowed in > recovery. Should we take a conservative stance and disallow taking > transaction-locks? Only used after xid assignment, which is disallowed. > LOCKTAG_VIRTUALTRANSACTION > > Any lock level is allowed. Similar to transaction locks, but virtual > transaction locks are held by read-only transactions as well. Also > during recovery, and we rely on it in the code to wait for a conflicting > transaction to finish. But we don't acquire locks to represent > transactions in master. Only ever requested as exclusive. > LOCKTAG_OBJECT, > > Anything higher than AccessShareLock is disallowed. Used by dependency > walking in pg_depend.c. Also used as interlock between database start > and DROP/CREATE DATABASE. At backend start, we normally take > RowExclusiveLock on the database in postinit.c, but you had to modify to > acquire AccessShareLock instead in standby mode. Yes > LOCKTAG_USERLOCK > LOCKTAG_ADVISORY > > Any lock level is allowed. As documented, advisory locks are per-server, > so a lock taken in master doesn't conflict with one taken in slave. Yes -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-11-19 at 17:15 +0900, Tatsuo Ishii wrote: > > Simon Riggs wrote: > > > Recovery does *not* take the same locks as the original statements on > > > the master took. For example, the WAL record for an INSERT just makes > > > its changes without acquiring locks. This is OK as long as we only allow > > > read-only users to acquire AccessShareLocks. If we allowed higher locks > > > we might need to do deadlock detection, which would add more complexity. > > > > But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans > > example shows. Is that a bug? > > Sorry for confusion. My example is under normal PostgreSQL, not under > HS enabled. Are you saying you want it to work in HS mode? Why would you want to PREPARE an INSERT, but never execute it? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Thu, 2009-11-19 at 17:15 +0900, Tatsuo Ishii wrote: >>> Simon Riggs wrote: >>>> Recovery does *not* take the same locks as the original statements on >>>> the master took. For example, the WAL record for an INSERT just makes >>>> its changes without acquiring locks. This is OK as long as we only allow >>>> read-only users to acquire AccessShareLocks. If we allowed higher locks >>>> we might need to do deadlock detection, which would add more complexity. >>> But we *do* allow higher locks than AccessShareLocks, as Tatsuo-sans >>> example shows. Is that a bug? >> Sorry for confusion. My example is under normal PostgreSQL, not under >> HS enabled. > > Are you saying you want it to work in HS mode? > > Why would you want to PREPARE an INSERT, but never execute it? well I can easily imagine an application that keeps persistent connections and prepares all the queries it might execute after it does the initial connection yet being still aware of the master/slave setup. So the scenario would be "prepare but never execute as long as we are in recovery - but once we left recovery we can usethem". Stefan
Joshua D. Drake wrote: > On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote: >> On 11/15/09 11:07 PM, Heikki Linnakangas wrote: >>> - When replaying b-tree deletions, we currently wait out/cancel all >>> running (read-only) transactions. We take the ultra-conservative stance >>> because we don't know how recent the tuples being deleted are. If we >>> could store a better estimate for latestRemovedXid in the WAL record, we >>> could make that less conservative. >> Simon was explaining this issue here at JPUGCon; now that I understand >> it, this specific issue seems like the worst usability issue in HS now. >> Bad enough to kill its usefulness for users, or even our ability to get >> useful testing data; in an OLTP production database with several hundred >> inserts per second it would result in pretty much never being able to >> get any query which takes longer than a few seconds to complete on the >> slave. > > I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers > that are doing more than that... This sounds pretty significant. Agreed, it's the biggest usability issue at the moment. The max_standby_delay option makes it less annoying, but it's still there. I'm fine with it from a code point of view, so I'm not going to hold off committing because of it, but it sure would be nice to address it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Nov 20, 2009 at 11:14 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 11/15/09 11:07 PM, Heikki Linnakangas wrote: >> - When replaying b-tree deletions, we currently wait out/cancel all >> running (read-only) transactions. We take the ultra-conservative stance >> because we don't know how recent the tuples being deleted are. If we >> could store a better estimate for latestRemovedXid in the WAL record, we >> could make that less conservative. > > Simon was explaining this issue here at JPUGCon; now that I understand > it, this specific issue seems like the worst usability issue in HS now. > Bad enough to kill its usefulness for users, or even our ability to get > useful testing data; in an OLTP production database with several hundred > inserts per second it would result in pretty much never being able to > get any query which takes longer than a few seconds to complete on the > slave. I don't think that's all that was discussed :) Are you saying that it should not be committed if this issue still exists? The point of getting Hot Standby into core is to provide useful functionality. We can make it clear to people what the limitations are, and Simon has said that he will continue to work on solving this problem. -selena -- http://chesnok.com/daily - me http://endpoint.com - work
On Fri, Nov 20, 2009 at 2:58 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > Right. The major use I was hoping for from HS was exactly to be able to run > long-running queries. In once case I am thinking of we have moved the > business intelligence uses off the OLTP server onto a londiste replica, and > I was really wanting to move that to a Hot Standby server. I think Simon's focus on the High Availability use case has obscured the fact that there are two entirely complementary (and conflicting) use cases here. If your primary reason for implementing Hot Standby is to be able to run long-running batch queries then will probably want to set a very high max_standby_delay or even disable it entirely. If you set max_standby_delay to 0 then the recovery will wait indefinitely for your batch queries to finish. You would probably need to schedule quiet periods in order to ensure that the recovery can catch up periodically. If you also need high availability you would need your HA replicas to run with a low max_standby_delay setting as well. This doesn't mean that the index btree split problem isn't a problem though. It's just trading one problem for another. Instead of having all your queries summarily killed regularly you would find recovery pausing extremely frequently for a very long time, rather than just when vacuum runs and for a limited time. I missed the original discussion of this problem, do you happen to remember the subject or url for the details? -- greg
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote: > On 11/15/09 11:07 PM, Heikki Linnakangas wrote: > > - When replaying b-tree deletions, we currently wait out/cancel all > > running (read-only) transactions. We take the ultra-conservative stance > > because we don't know how recent the tuples being deleted are. If we > > could store a better estimate for latestRemovedXid in the WAL record, we > > could make that less conservative. > > Simon was explaining this issue here at JPUGCon; now that I understand > it, this specific issue seems like the worst usability issue in HS now. > Bad enough to kill its usefulness for users, or even our ability to get > useful testing data; in an OLTP production database with several hundred > inserts per second it would result in pretty much never being able to > get any query which takes longer than a few seconds to complete on the > slave. <sigh> This post isn't really very helpful. You aren't providing the second part of the discussion, nor even requesting that this issue be fixed. I can see such comments being taken up by people with a clear interest in dissing HS. The case of several hundred inserts per second would not generate any cleanup records at all. So its not completely accurate, nor is it acceptable to generalise. There is nothing about the HS architecture that will prevent it from being used by high traffic sites, or for long standby queries. The specific action that will cause problems is a work load that generates high volume inserts and deletes. A solution is possible. Heikki and I had mentioned that solving this need not be part of the initial patch, since it wouldn't effect all users. I specifically removed my solution in July/Aug, to allow the patch to be slimmed down. In any case, the problem does have a simple workaround that is documented as part of the current patch. Conflict resolution is explained in detail with the patch. >From my side, the purpose of discussing this was to highlight something which is not technically a bug, yet clearly still needs work before close. And it also needs to be on the table, to allow further discussion and generate the impetus to allow work on it in this release. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote: > On Fri, Nov 20, 2009 at 2:58 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > Right. The major use I was hoping for from HS was exactly to be able to run > > long-running queries. In once case I am thinking of we have moved the > > business intelligence uses off the OLTP server onto a londiste replica, and > > I was really wanting to move that to a Hot Standby server. > > I think Simon's focus on the High Availability use case has obscured > the fact that there are two entirely complementary (and conflicting) > use cases here. If your primary reason for implementing Hot Standby is > to be able to run long-running batch queries then will probably want > to set a very high max_standby_delay or even disable it entirely. If > you set max_standby_delay to 0 then the recovery will wait > indefinitely for your batch queries to finish. You would probably need > to schedule quiet periods in order to ensure that the recovery can > catch up periodically. If you also need high availability you would > need your HA replicas to run with a low max_standby_delay setting as > well. If I read this correctly then I have provided the facilities you would like. Can you confirm you have everything you want, or can you suggest what extra feature is required? > This doesn't mean that the index btree split problem isn't a problem > though. It's just trading one problem for another. Instead of having > all your queries summarily killed regularly you would find recovery > pausing extremely frequently for a very long time, rather than just > when vacuum runs and for a limited time. > > I missed the original discussion of this problem, do you happen to > remember the subject or url for the details? December 2008; hackers; you, me and Heikki. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote: >> I missed the original discussion of this problem, do you happen to >> remember the subject or url for the details? > > December 2008; hackers; you, me and Heikki. Yep: http://archives.postgresql.org/message-id/494B5FFE.4090909@enterprisedb.com -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon, > <sigh> This post isn't really very helpful. You aren't providing the > second part of the discussion, nor even requesting that this issue be > fixed. I can see such comments being taken up by people with a clear > interest in dissing HS. OK, I'm requesting that the issue be fixed. I'm not sure if it needs to be fixed before Alpha3. I got the impression from you that others did not think this issue really needed fixing. > Heikki and I had mentioned that solving this need not be part of the > initial patch, since it wouldn't effect all users. I specifically > removed my solution in July/Aug, to allow the patch to be slimmed down. I guess I don't understand which users wouldn't be affected. It seems like the only users who would avoid this is ones who don't do deletes or index-affecting updates. >>From my side, the purpose of discussing this was to highlight something > which is not technically a bug, yet clearly still needs work before > close. And it also needs to be on the table, to allow further discussion > and generate the impetus to allow work on it in this release. Yes. I'm realizing that because of the highly techincal nature of the discussion and the language used few people other than you and Heikki are aware of the major issues which still need work. It would be helpful if someone could post a summary of outstanding issues which didn't require prior extensive experience with the HS code to understand; possibly you could then get people trying to tackle just those individual issues. --Josh BErkus
On Fri, 2009-11-20 at 11:14 +0900, Josh Berkus wrote: > On 11/15/09 11:07 PM, Heikki Linnakangas wrote: > > - When replaying b-tree deletions, we currently wait out/cancel all > > running (read-only) transactions. We take the ultra-conservative stance > > because we don't know how recent the tuples being deleted are. If we > > could store a better estimate for latestRemovedXid in the WAL record, we > > could make that less conservative. > > Simon was explaining this issue here at JPUGCon; now that I understand > it, this specific issue seems like the worst usability issue in HS now. > Bad enough to kill its usefulness for users, or even our ability to get > useful testing data; in an OLTP production database with several hundred > inserts per second it would result in pretty much never being able to > get any query which takes longer than a few seconds to complete on the > slave. I am pretty sure that OmniTI, PgExperts, EDB and CMD all have customers that are doing more than that... This sounds pretty significant. Joshua D. Drake > > --Josh Berkus > > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Fri, Nov 20, 2009 at 7:31 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon Riggs wrote: >> On Fri, 2009-11-20 at 06:47 +0000, Greg Stark wrote: >>> I missed the original discussion of this problem, do you happen to >>> remember the subject or url for the details? >> >> December 2008; hackers; you, me and Heikki. > > Yep: > http://archives.postgresql.org/message-id/494B5FFE.4090909@enterprisedb.com And I can see I failed to understand the issue at the time. From the list it looks like the last word was Simon's: http://archives.postgresql.org/message-id/1229710177.4793.567.camel@ebony.2ndQuadrant From discussions in the bar it sounds like this was actually a false start however as the RecentGlobalXmin in the backend doing the split could be less aggressive than the RecentGlobalXmin used by some other backend to hit the hint bits leading to inconsistent results :( I'm leaning towards having the backend actually go fetch all the xmin/xmaxes of the pointers being pruned. It ought to be possible to skip that check in any database with no live snapshots so recovery performance would be unaffected on replicas not actively being used in hot mode. -- greg
On Sun, 2009-11-15 at 17:17 -0500, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > So I'm in favor of committing part of the HS code even if there are > > known failure conditions, as long as those conditions are well-defined. > > If we're thinking of committing something that is known broken, I would > want to have a clearly defined and trust-inspiring escape strategy. If it is broken, we shouldn't commit it at all. Commit it to some "other" git branch and call it, postgresql-alpha3-riggs-heikki if you must but keep it out of core. > > I agree with Heikki that it would be better not to commit as long as > any clear showstoppers remain unresolved. > Agreed. Joshua D. Drake > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Fri, 2009-11-20 at 16:40 +0900, Josh Berkus wrote: > Yes. I'm realizing that because of the highly techincal nature of the > discussion and the language used few people other than you and Heikki > are aware of the major issues which still need work. It would be > helpful if someone could post a summary of outstanding issues which > didn't require prior extensive experience with the HS code to > understand; possibly you could then get people trying to tackle just > those individual issues. > Yes I believe it is time for that. Those of us neck deep in production loads would feel a lot better if we knew from a real world perspective what the issue is. Sincerely, Joshua D. Drake > --Josh BErkus > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Greg Stark wrote: > From discussions in the bar it sounds like this was actually a false > start however as the RecentGlobalXmin in the backend doing the split > could be less aggressive than the RecentGlobalXmin used by some other > backend to hit the hint bits leading to inconsistent results :( Yeah, RecentGlobalXmin was wrong, it's not used at the moment. > I'm leaning towards having the backend actually go fetch all the > xmin/xmaxes of the pointers being pruned. It ought to be possible to > skip that check in any database with no live snapshots so recovery > performance would be unaffected on replicas not actively being used in > hot mode. Hmm, I have always been thinking that it would be detrimental to performance to go fetch the xmin/xmaxes, but maybe it indeed wouldn't be so bad if you could optimize the common case where there's no snapshots in the standby. Also, if you have a very busy table where a lot of tuples are killed, many of the heap pages will probably be in cache. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 17:17 -0500, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > So I'm in favor of committing part of the HS code even if there are > > known failure conditions, as long as those conditions are well-defined. > > If we're thinking of committing something that is known broken, I would > want to have a clearly defined and trust-inspiring escape strategy. If it is broken, we shouldn't commit it at all. Commit it to some "other" git branch and call it, postgresql-alpha3-riggs-heikki if you must but keep it out of core. > > I agree with Heikki that it would be better not to commit as long as > any clear showstoppers remain unresolved. > Agreed. Joshua D. Drake > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Fri, 2009-11-20 at 16:40 +0900, Josh Berkus wrote: > Yes. I'm realizing that because of the highly techincal nature of the > discussion and the language used few people other than you and Heikki > are aware of the major issues which still need work. It would be > helpful if someone could post a summary of outstanding issues which > didn't require prior extensive experience with the HS code to > understand; possibly you could then get people trying to tackle just > those individual issues. > Yes I believe it is time for that. Those of us neck deep in production loads would feel a lot better if we knew from a real world perspective what the issue is. Sincerely, Joshua D. Drake > --Josh BErkus > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Thu, 2009-11-19 at 10:13 +0200, Heikki Linnakangas wrote: > At backend start, we normally take > RowExclusiveLock on the database in postinit.c, but you had to modify > to acquire AccessShareLock instead in standby mode. The consensus from earlier discussion was that allowing users to grab RowExclusiveLock during read only transactions was not a problem, since it allowed PREPARE. So there seems no need to prevent it in other places. So I suggest removing most of the changes in postinit.c, and changing the lock restrictions in lock.c to be + if (RecoveryInProgress() && + (locktag->locktag_type == LOCKTAG_OBJECT || + locktag->locktag_type == LOCKTAG_RELATION ) && + lockmode > RowExclusiveLock) then ERROR lockcmds.c would also be changed to allow LOCK TABLE of up to RowExclusiveLock. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > - The assumption that b-tree vacuum records don't need conflict > resolution because we did that with the additional cleanup-info record > works ATM, but it hinges on the fact that we don't delete any tuples > marked as killed while we do the vacuum. That seems like a low-hanging > fruit that I'd actually like to do now that I spotted it, but will then > need to fix b-tree vacuum records accordingly. You didn't make a change, so I wonder whether you realised no change was required or that you still think change is necessary, but have left it to me? Not sure. I've investigated this but can't see any problem or need for change. btvacuumpage() is very specific about deleting *only* index tuples that have been collected during the VACUUM's heap scan, as identified by the heap callback function, lazy_tid_reaped(). There is no reliance at all on the state of the index tuple. If you ain't on the list, you ain't cleaned. I accept your observation that some additional index tuples may be marked as killed by backends accessing the table that is being vacuumed, since those backends could have a RecentGlobalXmin later than the OldestXmin used by the VACUUM as a result of the change that means GetSnapshotData() ignores lazy vacuums. But those tuples will not be identified by the callback function and so the "additionally killed" index tuples will not be removed. It is a possible future optimisation of b-tree vacuum that it cleans these additional killed tuples while it executes, but it doesn't do it now and so we need not worry about that for HS. I think its important that we note this assumption though. Comment? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > >> - The assumption that b-tree vacuum records don't need conflict >> resolution because we did that with the additional cleanup-info record >> works ATM, but it hinges on the fact that we don't delete any tuples >> marked as killed while we do the vacuum. That seems like a low-hanging >> fruit that I'd actually like to do now that I spotted it, but will then >> need to fix b-tree vacuum records accordingly. > > You didn't make a change, so I wonder whether you realised no change was > required or that you still think change is necessary, but have left it > to me? Not sure. > > I've investigated this but can't see any problem or need for change. Sorry if I was unclear: it works as it is. But *if* we change the b-tree vacuum to also delete index tuples marked with LP_DEAD, we have a problem. > I think its important that we note this assumption though. Yeah, a comment is in order. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > - When switching from standby mode to normal operation, we momentarily > hold all AccessExclusiveLocks held by prepared xacts twice, needing > twice the lock space. You can run out of lock space at that point, > causing failover to fail. Proposed patch to fix that attached. -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs wrote: > On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > >> - When switching from standby mode to normal operation, we momentarily >> hold all AccessExclusiveLocks held by prepared xacts twice, needing >> twice the lock space. You can run out of lock space at that point, >> causing failover to fail. > > Proposed patch to fix that attached. Doesn't eliminate the problem completely, but certainly makes it much less likely. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-11-15 at 16:07 +0200, Heikki Linnakangas wrote: > - When replaying b-tree deletions, we currently wait out/cancel all > running (read-only) transactions. We take the ultra-conservative > stance because we don't know how recent the tuples being deleted are. > If we could store a better estimate for latestRemovedXid in the WAL > record, we could make that less conservative. I think I can improve on the way we do this somewhat. When we GetConflictingVirtualXIDs() with InvalidTransactionId we include all backends. If a query can only see currently-running xacts then it cannot see any data that is being cleaned up because its xmin > latestCompletedXid. Put another way, Assert(latestRemovedXid <= latestCompletedXid) -- Simon Riggs www.2ndQuadrant.com