Thread: Hot Standby dev build (v8)
http://wiki.postgresql.org/wiki/Hot_Standby v8 - resolves a dozen minor issues. Dev build with various debugging enabled. Expect clean v10 on Friday. All outstanding items from emails now listed on Wiki. Some new refactoring targets and tuning opportunities identified as knock-on effects from earlier refactoring. Testing started on this about 3 hrs ago, nothing new found as yet. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Wed, 2009-01-14 at 23:55 +0000, Simon Riggs wrote: > http://wiki.postgresql.org/wiki/Hot_Standby > > v8 - resolves a dozen minor issues. Found and fixed another 8 issues, through code inspection, bash test and specific feature tests. Looking much better; still working - check Wiki for status. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
I don't think RecentGlobalXmin is good enough here: > ! /* > ! * We would like to set an accurate latestRemovedXid, but there > ! * is no easy way of obtaining a useful value. So we use the > ! * probably far too conservative value of RecentGlobalXmin instead. > ! */ > ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; > ! rdata[0].data = (char *) &xlrec_delete; > ! rdata[0].len = SizeOfBtreeDelete; RecentGlobalXmin is just a hint, it lags behind the real oldest xmin that GetOldestXmin() would return. If another backend has a more recent RecentGlobalXmin value, and has killed more recent tuples on the page, the latestRemovedXid written here is too old. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote: > I don't think RecentGlobalXmin is good enough here: > > > ! /* > > ! * We would like to set an accurate latestRemovedXid, but there > > ! * is no easy way of obtaining a useful value. So we use the > > ! * probably far too conservative value of RecentGlobalXmin instead. > > ! */ > > ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; > > ! rdata[0].data = (char *) &xlrec_delete; > > ! rdata[0].len = SizeOfBtreeDelete; > > RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > that GetOldestXmin() would return. If another backend has a more recent > RecentGlobalXmin value, and has killed more recent tuples on the page, > the latestRemovedXid written here is too old. What do you think we should do instead? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote: >> I don't think RecentGlobalXmin is good enough here: >> >>> ! /* >>> ! * We would like to set an accurate latestRemovedXid, but there >>> ! * is no easy way of obtaining a useful value. So we use the >>> ! * probably far too conservative value of RecentGlobalXmin instead. >>> ! */ >>> ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; >>> ! rdata[0].data = (char *) &xlrec_delete; >>> ! rdata[0].len = SizeOfBtreeDelete; >> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin >> that GetOldestXmin() would return. If another backend has a more recent >> RecentGlobalXmin value, and has killed more recent tuples on the page, >> the latestRemovedXid written here is too old. > > What do you think we should do instead? Dunno. Maybe call GetOldestXmin(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: > >> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > >> that GetOldestXmin() would return. If another backend has a more recent > >> RecentGlobalXmin value, and has killed more recent tuples on the page, > >> the latestRemovedXid written here is too old. > > > > What do you think we should do instead? > > Dunno. Maybe call GetOldestXmin(). We are discussing btree deletes, not btree vacuums. If we are doing btree delete then we have an unreleased snapshot therefore we also have a non-zero xmin. How can another backend have a later RecentGlobalXmin or result from GetOldestXmin() than we do? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: > >>>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin >>>> that GetOldestXmin() would return. If another backend has a more recent >>>> RecentGlobalXmin value, and has killed more recent tuples on the page, >>>> the latestRemovedXid written here is too old. >>> What do you think we should do instead? >> Dunno. Maybe call GetOldestXmin(). > > We are discussing btree deletes, not btree vacuums. Pardon my ignorance, but what's the difference? > If we are doing > btree delete then we have an unreleased snapshot therefore we also have > a non-zero xmin. How can another backend have a later RecentGlobalXmin > or result from GetOldestXmin() than we do? Sure it can, for example: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1 9. Transaciont 3 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 2 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 09:16 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: > > > >>>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > >>>> that GetOldestXmin() would return. If another backend has a more recent > >>>> RecentGlobalXmin value, and has killed more recent tuples on the page, > >>>> the latestRemovedXid written here is too old. > >>> What do you think we should do instead? > >> Dunno. Maybe call GetOldestXmin(). > > > > We are discussing btree deletes, not btree vacuums. > > Pardon my ignorance, but what's the difference? In terms of current HEAD, not much. In terms of Hot Standby, a significant difference - the two actions have been split, rather than continuing to share the same WAL record. XLOG_BTREE_VACUUM removes index tuples as a result of a vacuum. The initial scan of the heap already generated an XLOG_HEAP2_CLEANUP_INFO which gives the latestRemovedXid for that vacuum. So we don't need to worry about putting a latestRemovedXid on XLOG_BTREE_VACUUM. The WAL records also differ because the XLOG_BTREE_VACUUM contains details of blocks that need to be pinned but not otherwise touched. XLOG_BTREE_DELETE is different in 3 ways. It isn't part of a vacuum, so: * we don't need to take a cleanup lock * it doesn't contain info about other blocks we need to scan beforehand for correctness purposes * it wasn't preceded by an XLOG_HEAP2_CLEANUP_INFO record, so it must have a *correct* (even if too conservative) value for latestRemovedXid set. So the only time we need to set latestRemovedXid correctly is during a normal transaction, not during a vacuum. > > If we are doing > > btree delete then we have an unreleased snapshot therefore we also have > > a non-zero xmin. How can another backend have a later RecentGlobalXmin > > or result from GetOldestXmin() than we do? > > Sure it can, for example: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1 > 9. Transaciont 3 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 2 Well, steps 7 and 8 don't make sense. Your earlier comment was that it was possible for a WAL record to be written with a RecentGlobalXmin that was lower than other backends values. In step 9 the RecentGlobalXmin is *not* lower than any other backend, it is the same. So if there is a proof, this isn't it. But I can't see how there can be one: Two concurrent vacuums can have different OldestXmin values, but two concurrent transactions cannot. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Well, steps 7 and 8 don't make sense. > > Your earlier comment was that it was possible for a WAL record to be > written with a RecentGlobalXmin that was lower than other backends > values. In step 9 the RecentGlobalXmin is *not* lower than any other > backend, it is the same. > > So if there is a proof, this isn't it. Yeah, you're right. I got steps 8 and 9 mixed. Let me try again: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2 9. Transaction 4 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 1 > But I can't see how there can be one: Two concurrent vacuums can have > different OldestXmin values, but two concurrent transactions cannot. Of course they can. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 12:22 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Well, steps 7 and 8 don't make sense. > > > > Your earlier comment was that it was possible for a WAL record to be > > written with a RecentGlobalXmin that was lower than other backends > > values. In step 9 the RecentGlobalXmin is *not* lower than any other > > backend, it is the same. > > > > So if there is a proof, this isn't it. > > Yeah, you're right. I got steps 8 and 9 mixed. Let me try again: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2 > 9. Transaction 4 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 1 One of us needs a coffee. How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at step (9) the value of RecentGlobalXmin has gone backwards? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > One of us needs a coffee. Clearly, I put the kettle on... > How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at > step (9) the value of RecentGlobalXmin has gone backwards? Looks like I mixed up the xids of the two transactions in steps 8 and 9. Let's see if I got it right this time: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2 9. Transaction 3 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 1 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 12:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > One of us needs a coffee. > > Clearly, I put the kettle on... I had one too, just in case. > > How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at > > step (9) the value of RecentGlobalXmin has gone backwards? > > Looks like I mixed up the xids of the two transactions in steps 8 and 9. > Let's see if I got it right this time: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2 > 9. Transaction 3 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 1 I don't see how step (5) is possible. GetSnapshotData() sets RecentGlobalXmin to the result of the snapshot's xmin. If step (5) is possible, then yes, step (9) can happen. You are correct to say that RecentGlobalXmin is not always correctly set. All I'm saying is that at the exact time, place and circumstance I use it, it is correct. In other cases, it may not be. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > GetSnapshotData() sets > RecentGlobalXmin to the result of the snapshot's xmin. No. RecentGlobalXmin is set to the oldest *xmin* observed, across all running transactions. TransactionXmin is the xid of the oldest running transaction. IOW, RecentGlobalXmin is the xid of transaction that the oldest running transaction sees as running. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 14:00 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > GetSnapshotData() sets > > RecentGlobalXmin to the result of the snapshot's xmin. > > No. RecentGlobalXmin is set to the oldest *xmin* observed, across all > running transactions. TransactionXmin is the xid of the oldest running > transaction. IOW, RecentGlobalXmin is the xid of transaction that the > oldest running transaction sees as running. OK. That was fun. These WAL records are annoying, no matter what the exact value of latestRemovedXid is and they seem likely to conflict with many queries on the standby. If we don't use RecentGlobalXmin then I can't see any easily derived value that we can use in its place. It isn't worth the effort on the master to derive a more exact value, not when we don't even know if it matters yet. I suggest we handle this on the recovery side, not on the master, by deriving the xmin at the point the WAL record arrives. We would calculate it by looking at recovery procs only. That will likely give us a later value than we would get from the master, but that can't be helped. For me, this makes it essential now that I put in place the deferred cancellation mechanism. Some refactoring in this area is also required because we need to handle two other types of conflict to correctly support drop database and drop user, which is now underway. Btree deletes were an important optimisation when it first went it, but now we have HOT it is much less important. Another route might be to put an option to turn off btree delete on the master, default = on. We probably should consider turning it off entirely when it doesn't yield significant benefit. Lots of scanning to remove the odd row is probably pretty wasteful and likely adds contention at the very point we don't want it - index splits. Thoughts? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I suggest we handle this on the recovery side, not on the master, by > deriving the xmin at the point the WAL record arrives. We would > calculate it by looking at recovery procs only. That will likely give us > a later value than we would get from the master, but that can't be > helped. Hmm, that's an interesting idea. It presumes that we see an abort/commit WAL record at the right moment for every transaction that we have a recovery proc for. We just concluded in the other thread that we do always emit abortion records when the database is running normally; I think that's good enough for this purpose. A few other random ideas I had: - in btree delete redo, follow the index pointers, and look at the xids on the heap tuples. That requires some random I/O, but will give the exact value we need. Since it's quite expensive, I think we'd only want to do it after using some more conservative test but quicker test to determine that there might be a conflict. - Add latestRemovedXid to b-tree page header, and update it as tuples are killed. Need to tolerate the fact that tuple kills are not WAL-logged. > Btree deletes were an important optimisation when it first went it, but > now we have HOT it is much less important. If HOT is working well for your application, there won't be many btree deletes anyway, and the whole issue is moot. > Another route might be to put > an option to turn off btree delete on the master, default = on. We > probably should consider turning it off entirely when it doesn't yield > significant benefit. I'd rather put in a generic mechanism to prevent vacuuming of recent tuples that might still be needed in the standby. Like always subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or having a feedback loop from the standby to the master, allowing the master to say what it's oldest xmin is. But that's a fair amount of work; I'd rather leave that as a future enhancement, and just figure out something simple for this specific issue. We'll need to handle it gracefully even if we try to avoid it by retaining dead tuples longer. > Lots of scanning to remove the odd row is probably > pretty wasteful and likely adds contention at the very point we don't > want it - index splits. Remember that if you can remove enough dead tuples from the index page, you've just made room on the page and don't need to split. Splitting is pretty expensive compared to scanning a few line pointers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 15:47 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I suggest we handle this on the recovery side, not on the master, by > > deriving the xmin at the point the WAL record arrives. We would > > calculate it by looking at recovery procs only. That will likely give us > > a later value than we would get from the master, but that can't be > > helped. > > Hmm, that's an interesting idea. It presumes that we see an abort/commit > WAL record at the right moment for every transaction that we have a > recovery proc for. We just concluded in the other thread that we do > always emit abortion records when the database is running normally; I > think that's good enough for this purpose. But not perfect. > A few other random ideas I had: > > - in btree delete redo, follow the index pointers, and look at the xids > on the heap tuples. That requires some random I/O, but will give the > exact value we need. Since it's quite expensive, I think we'd only want > to do it after using some more conservative test but quicker test to > determine that there might be a conflict. Ouch. > - Add latestRemovedXid to b-tree page header, and update it as tuples > are killed. Need to tolerate the fact that tuple kills are not WAL-logged. Sounds easy-ish. If tuple kills aren't WAL logged then if we crash latestRemovedXid will remain as it was at time of last write. So if we do a delete scan it will only remove the index tuples with hint bits set at time of that write, so the value would always be correct, no? I'm somehow uncomfortable with this idea though. Care to persuade me further? > > Btree deletes were an important optimisation when it first went it, but > > now we have HOT it is much less important. > > If HOT is working well for your application, there won't be many btree > deletes anyway, and the whole issue is moot. That was my point. > > Another route might be to put > > an option to turn off btree delete on the master, default = on. We > > probably should consider turning it off entirely when it doesn't yield > > significant benefit. > > I'd rather put in a generic mechanism to prevent vacuuming of recent > tuples that might still be needed in the standby. Like always > subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or > having a feedback loop from the standby to the master, allowing the > master to say what it's oldest xmin is. But that's a fair amount of > work; I'd rather leave that as a future enhancement, and just figure out > something simple for this specific issue. We'll need to handle it > gracefully even if we try to avoid it by retaining dead tuples longer. Yeh, looked at both of those also. Definitely after sync rep goes in though. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2009-01-19 at 12:54 +0000, Simon Riggs wrote: > Some refactoring in this area is also required > because we need to handle two other types of conflict to correctly > support drop database and drop user, which is now underway. I've hung the drop database conflict code in dbase_redo(). For drop role, there isn't an rmgr at all, but I can add code in a few places. * add XLOG_DBASE_DROP_USER - i.e. add drop user to the Database rmgr * DropRole() takes an AccessExclusiveLock, so we do write a WAL record for it. I could add a special case to the Relation rmgr. * Add a new rmgr (unused slot 7) and have it handle DropRole. I prefer the last one, but if you think otherwise, please shout. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I prefer the last one, but if you think otherwise, please shout. We're now emitting WAL records for relcache invalidations, IIRC. I wonder if those are useful for this? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-01-19 at 16:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I prefer the last one, but if you think otherwise, please shout. > > We're now emitting WAL records for relcache invalidations, IIRC. I > wonder if those are useful for this? Tom already objected to putting strange inval messages into WAL. Hmm, DROP USER is transactional, so we can only do this at commit. So forget the other ideas I had. We already know about the auth file update at commit. So we should say, at commit, re-read the list of roleids in use and if any don't match a row in pg_user then remove them. If we do that after the flat file update and the actual commit that removes the user then we will be guaranteed no race condition exists to allow new users to logon as we try to disconnect them. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support