Thread: Block level concurrency during recovery
I'm looking at how to make queries safe during recovery, in the presence of concurrent changes to blocks. In particular, concurrent removal of rows that might be read by queries. My thinking is * we ignore LockRelationForExtension(). Normal queries never request it. All new blocks were created with that lock held and we are replaying changes serially, so we do not need to re-create that lock. We already do this, so no change. * re-create the Cleanup lock on blocks, when the original operation was performed while a Cleanup lock was held. So the plan is to introduce a new XLogLockBufferForCleanup() function and then use it in all places where a cleanup lock was held during the write of the WAL record. This means we will need to hold cleanup lock: * while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove) * while an XLOG_BTREE_DELETE was generated by VACUUM - this record type is not always generated by VACUUM. So split this WAL record into two types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but I'd rather do the full locking now and relax it later). * Whenever we apply a backup block that performs the same function as any of the above WAL records. So we would hold Cleanup lock when applying WAL records of types all RM_HEAP2_ID types XLOG_BTREE_VACUUM I'm most of the way through implementing the above and will post patch as a separate item to make it easier to inspect. Other aspects: * For GIN indexes, we appear to not hold a Cleanup lock during vacuuming, except on root page. That stops new scans from starting, but it doesn't prevent progress of concurrent scans. Doesn't look correct to me... so not sure what strength lock to acquire in each case. Probably need to differentiate between WAL record types so we can tell which to acquire CleanupLock for and which not. * GIST doesn't use CleaupLocks at all. So I'm very unclear here. Teodor has mentioned that it should be OK for GIST/GIN. Can I double check that based upon my inspection of the code? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > * re-create the Cleanup lock on blocks, when the original operation was > performed while a Cleanup lock was held. > > So the plan is to introduce a new XLogLockBufferForCleanup() function > and then use it in all places where a cleanup lock was held during the > write of the WAL record. I'm not sure I'm following what you're talking about. Are you saying the slave will have to stop when it reaches a cleanup log entry and wait until it can obtain the exclusive lock with no other pins? That sounds like the right thing though it could cause a long pause in WAL recovery. Perhaps we should plan on being able to kill other processes holding pins on the buffer just as we discussed killing transactions holding up advancing the globalxmin. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Mon, 2008-10-20 at 14:23 +0100, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > * re-create the Cleanup lock on blocks, when the original operation was > > performed while a Cleanup lock was held. > > > > So the plan is to introduce a new XLogLockBufferForCleanup() function > > and then use it in all places where a cleanup lock was held during the > > write of the WAL record. > > I'm not sure I'm following what you're talking about. Are you saying the slave > will have to stop when it reaches a cleanup log entry and wait until it can > obtain the exclusive lock with no other pins? Yes. > That sounds like the right thing though it could cause a long pause in WAL > recovery. It could. > Perhaps we should plan on being able to kill other processes holding > pins on the buffer just as we discussed killing transactions holding up > advancing the globalxmin. I'll copy my notes straight out of the Wiki: We can't cancel a query that holds a pin on a particular block since we don't keep track of who holds a pin. We just know a backend has a pin and that the startup process must wait. This may become a problem, or it may not. In most cases, backends hold 5-15 pins concurrently and pins are held for relatively short times. Startup process will provide debug information for when pin times are extended, and for monitoring total delay from pin waits. Some strategies, if this becomes a problem: * minimise pin hold times wherever this occurs * deferring application of WAL for pinned blocks (requiring us to use conditional locking) * making VACUUM FREEZE hold stronger locks on standby to prevent query access(but doesn't help HOT) * ensuring that we perform read ahead I/O for WAL records that have not yet reachedhead of apply queue Most people on-list were comfortable with the idea of recovery waiting for queries to finish, up to a limit, re: discussion around max_standby_delay. So waiting for cleanup locks is just another source of potential wait time. I think it will take a while for people to come up with some user experience of whether this will be a pain or not. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> * For GIN indexes, we appear to not hold a Cleanup lock during > vacuuming, except on root page. That stops new scans from starting, but > it doesn't prevent progress of concurrent scans. Doesn't look correct to > me... so not sure what strength lock to acquire in each case. Probably Why do you think so? > need to differentiate between WAL record types so we can tell which to > acquire CleanupLock for and which not. > > * GIST doesn't use CleaupLocks at all. So I'm very unclear here. Scan process does not hold any pointer on page now, insertion process holds but it tracks changes of page by looking at LSN of page. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Mon, 2008-10-20 at 18:28 +0400, Teodor Sigaev wrote: > > * For GIN indexes, we appear to not hold a Cleanup lock during > > vacuuming, except on root page. That stops new scans from starting, but > > it doesn't prevent progress of concurrent scans. Doesn't look correct to > > me... so not sure what strength lock to acquire in each case. Probably > Why do you think so? I have questions. I don't understand why in ginVacuumPostingTreeLeaves() we lock only the root page for Cleanup and no others. Why do we need to hold Cleanup lock on the root? If the index is concurrent safe for existing scans, why is it not safe for new scans also? And the converse: if it is not safe for new scans, why is it safe for existing scans? > > need to differentiate between WAL record types so we can tell which to > > acquire CleanupLock for and which not. > > > > * GIST doesn't use CleaupLocks at all. So I'm very unclear here. > Scan process does not hold any pointer on page now, insertion process holds but > it tracks changes of page by looking at LSN of page. Sounds good. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> I don't understand why in ginVacuumPostingTreeLeaves() we lock only the > root page for Cleanup and no others. Why do we need to hold Cleanup lock > on the root? If the index is concurrent safe for existing scans, why is > it not safe for new scans also? And the converse: if it is not safe for > new scans, why is it safe for existing scans? Because we wish to prevent concurrent inserts and page deletion just to simplify code. LockForCleanup guarantees that insertion process is not work here (it keeps root buffer pinned all time of insertion). New scan processes can't start as a side effect. Note, in most cases it keeps enough concurrence because all that is about work on one tree in GIN index. Usually, there is a lot of such trees in index - for each lexeme if we speak about tsearch index. So, there is a place for improvements but I don't believe that will give a big advantage for performance in typical usage of GIN. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Mon, 2008-10-20 at 20:12 +0400, Teodor Sigaev wrote: > > I don't understand why in ginVacuumPostingTreeLeaves() we lock only the > > root page for Cleanup and no others. Why do we need to hold Cleanup lock > > on the root? If the index is concurrent safe for existing scans, why is > > it not safe for new scans also? And the converse: if it is not safe for > > new scans, why is it safe for existing scans? > > Because we wish to prevent concurrent inserts and page deletion just to simplify > code. LockForCleanup guarantees that insertion process is not work here (it > keeps root buffer pinned all time of insertion). New scan processes can't start > as a side effect. But does holding cleanup lock on root prevent an in-progress Insert from changing non-root pages? I assume so, just not sure how. > Note, in most cases it keeps enough concurrence because all that is about work > on one tree in GIN index. Usually, there is a lot of such trees in index - for > each lexeme if we speak about tsearch index. So, there is a place for > improvements but I don't believe that will give a big advantage for performance > in typical usage of GIN. I'm just worried about safety during Hot Standby, not trying to improve anything. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> But does holding cleanup lock on root prevent an in-progress Insert from > changing non-root pages? I assume so, just not sure how. Yes, because insertion process doesn't unpin root page until insert will done. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Mon, 2008-10-20 at 13:54 +0100, Simon Riggs wrote: > I'm looking at how to make queries safe during recovery, in the presence > of concurrent changes to blocks. In particular, concurrent removal of > rows that might be read by queries. > > My thinking is > * we ignore LockRelationForExtension(). Normal queries never request it. > All new blocks were created with that lock held and we are replaying > changes serially, so we do not need to re-create that lock. We already > do this, so no change. > * re-create the Cleanup lock on blocks, when the original operation was > performed while a Cleanup lock was held. > > So the plan is to introduce a new XLogLockBufferForCleanup() function > and then use it in all places where a cleanup lock was held during the > write of the WAL record. > > This means we will need to hold cleanup lock: > > * while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove) > > * while an XLOG_BTREE_DELETE was generated by VACUUM - this record type > is not always generated by VACUUM. So split this WAL record into two > types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup > lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but > I'd rather do the full locking now and relax it later). > > * Whenever we apply a backup block that performs the same function as > any of the above WAL records. So we would hold Cleanup lock when > applying WAL records of types > all RM_HEAP2_ID types > XLOG_BTREE_VACUUM > > I'm most of the way through implementing the above and will post patch > as a separate item to make it easier to inspect. Here's the code to implement this portion. I'm undecided as to whether this is too much code or too little. I'm somewhat uncertain as to the locking requirements for the physical scan during a vacuum. I've worked out various options if we need to change this. Also, the way I take a cleanup lock during RestoreBkpBlocks() seems very ugly, but too much additional code seems over the top also. I'm integrating into rest of patch now. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Simon Riggs wrote: > I'm undecided as to whether this is too much code or too little. I'm > somewhat uncertain as to the locking requirements for the physical scan > during a vacuum. I've worked out various options if we need to change > this. For the B-tree, an exclusive lock is enough to modify the contents of the page. A cleanup lock needs to be taken on every page to ensure that the vacuum doesn't finish and delete a heap tuple that's about to be accessed by an index scan. That could be handled in WAL replay by dropping the exclusive lock and re-acquiring the cleanup lock, but we can't do that for heap vacuums. However, we require that in b-tree vacuum, you take a cleanup lock on *every* leaf page of the index, not only those that you modify. That's a problem, because there's no trace of such pages in the WAL. PS. FSM doesn't need cleanup locks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote: > However, we require that in b-tree vacuum, you take a cleanup lock on > *every* leaf page of the index, not only those that you modify. That's a > problem, because there's no trace of such pages in the WAL. OK, good. Thanks for the second opinion. I'm glad you said that, cos I felt sure anybody reading the patch would say "what the hell does this bit do?". Now I can add it. My solution is fairly simple: As we pass through the table we keep track of which blocks need visiting, then append that information onto the next WAL record. If the last block doesn't contain removed rows, then we send a no-op message saying which blocks to visit. I'd already invented the XLOG_BTREE_VACUUM record, so now we just need to augment it further with two fields: ordered array of blocks to visit, and a doit flag. Say we have a 10 block table, with rows to be removed on blocks 3,4,8. As we visit all 10 in sequence we would issue WAL records: XLOG_BTREE_VACUUM block 3 visitFirst {1, 2} doit = true XLOG_BTREE_VACUUM block 4 visitFirst {} doit = true XLOG_BTREE_VACUUM block 8 visitFirst {5,6,7} doit = true XLOG_BTREE_VACUUM block 10 visitFirst {9} doit = false So that allows us to issue the same number of WAL messages yet include all the required information to repeat the process correctly. (The blocks can be visited out of sequence in some cases, hence the ordered array of blocks to visit rather than just a first block value). It would also be possible to introduce a special tweak there which is that if the block is not in cache, don't read it in at all. If its not in cache we know that nobody has a pin on it, so don't need to read it in just to say "got the lock". That icing for later. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > It would also be possible to introduce a special tweak there which is > that if the block is not in cache, don't read it in at all. If its not > in cache we know that nobody has a pin on it, so don't need to read it > in just to say "got the lock". That icing for later. Heh, that's clever :-). We could actually use a method like that to solve the whole problem. After the (replay of the) b-tree vacuum is finished, scan through all shared buffers, and get a vacuum lock on every page of that index that's in cache. Groveling through all shared buffers would be slower for small indexes, though. I believe the "vacuum lock every leaf page" behavior is only needed for system catalogs. You have other issues with those still, namely that table locks are not yet taken appropriately, so I'm not sure if this is worth worrying about until that's done. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It would also be possible to introduce a special tweak there which is > > that if the block is not in cache, don't read it in at all. If its not > > in cache we know that nobody has a pin on it, so don't need to read it > > in just to say "got the lock". That icing for later. > > Heh, that's clever :-). We could actually use a method like that to > solve the whole problem. After the (replay of the) b-tree vacuum is > finished, scan through all shared buffers, and get a vacuum lock on > every page of that index that's in cache. Groveling through all shared > buffers would be slower for small indexes, though. Well, re-examining all of the assumptions in the code seems to have been worthwhile so far. I think that makes 4 significant tweaks that have come out of the Search For Hot Standby. > I believe the "vacuum lock every leaf page" behavior is only needed for > system catalogs. So we will still need it then. Which is good 'cos I just wrote it. > You have other issues with those still, namely that > table locks are not yet taken appropriately, so I'm not sure if this is > worth worrying about until that's done. Please explain. If you know of a correctness issue, please say. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote: >> You have other issues with those still, namely that >> table locks are not yet taken appropriately, so I'm not sure if this is >> worth worrying about until that's done. > > Please explain. If you know of a correctness issue, please say. Things like DROP TABLE need to take an AccessExclusiveLock, and you don't have that yet, no? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-23 at 20:24 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote: > >> You have other issues with those still, namely that > >> table locks are not yet taken appropriately, so I'm not sure if this is > >> worth worrying about until that's done. > > > > Please explain. If you know of a correctness issue, please say. > > Things like DROP TABLE need to take an AccessExclusiveLock, and you > don't have that yet, no? Oh right, I thought you meant something not already in the plan. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-10-23 at 09:57 +0100, Simon Riggs wrote: > On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote: > > > However, we require that in b-tree vacuum, you take a cleanup lock on > > *every* leaf page of the index, not only those that you modify. That's a > > problem, because there's no trace of such pages in the WAL. > > OK, good. Thanks for the second opinion. I'm glad you said that, cos I > felt sure anybody reading the patch would say "what the hell does this > bit do?". Now I can add it. Heikki, When we discussed this before, I was glad that you'd mentioned that aspect since it allowed me to say "if two of us think that then it must be true". I didn't include that in the final patch because it felt wrong. I didn't have a rational explanation for that then, just a bad feeling. So, after lots of sleep, here's my rational explanation of why we do *not* need that during hot standby queries: VACUUM with a btree index proceeds like this: 1. Scan table 2. Remove rows from btree identified in (1) 3. Remove rows from heap identified in (1) The purpose of the additional locking requirements during (2) for btrees is to ensure that we do not fail to find the rows identified in (1), because the rows can move after (1) and during (2) because of block splits. Requoting verbatim from the README: "The tricky part of this is to avoid missing any deletable tuples in the presence of concurrent page splits: a page split could easily move some tuples from a page not yet passed over by the sequential scan to a lower-numbered page already passed over." In recovery there are no concurrent page splits and the WAL records represent already successfully identified deletable tuples. On a standby server the rows will not move other than via WAL records. So there is no possibility that a WAL record will fail to find the row it was looking for. On the master we were looking for a tuple that pointed to a htid, whereas in WAL replay we look directly at the index tuple via its tid, not via the htid it points to. Therefore we do not need the additional locking. That seems logical to me, so I will leave that out. Any alternative views? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > VACUUM with a btree index proceeds like this: > 1. Scan table > 2. Remove rows from btree identified in (1) > 3. Remove rows from heap identified in (1) > The purpose of the additional locking requirements during (2) for btrees > is to ensure that we do not fail to find the rows identified in (1), > because the rows can move after (1) and during (2) because of block > splits. No, you are missing the point. One purpose of the additional locking requirements is to ensure that there is not a concurrent process that has read a btree index entry just before you removed it but arrives at the heap page only after you removed the heap entry (and, perhaps, replaced it with some other row that doesn't match the index entry at all). This is clearly still a risk in a hot-standby environment. regards, tom lane
On Mon, 2008-11-03 at 10:07 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > VACUUM with a btree index proceeds like this: > > 1. Scan table > > 2. Remove rows from btree identified in (1) > > 3. Remove rows from heap identified in (1) > > > The purpose of the additional locking requirements during (2) for btrees > > is to ensure that we do not fail to find the rows identified in (1), > > because the rows can move after (1) and during (2) because of block > > splits. > > No, you are missing the point. One purpose of the additional locking > requirements is to ensure that there is not a concurrent process that > has read a btree index entry just before you removed it but arrives at > the heap page only after you removed the heap entry (and, perhaps, > replaced it with some other row that doesn't match the index entry at > all). This is clearly still a risk in a hot-standby environment. OK, I think I get it now. Thanks for putting me straight. So I will implement the locking-every-page approach discussed upthread. So I will just keep note of the blocks touched exactly in that order and store the info accordingly onto the WAL records. Are you happy with my optimisation that if a page needs to be read in, we just skip it and pretend we did read-pin-unpin on it? I would implement that as a new ReadBuffer mode (in Heikki's new API terminology). If you know/can see any other missing correctness requirements please let me know. I've not had trouble understanding any of the other correctness requirements, but I'll leave it to review to judge whether I've implemented them all correctly. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > Are you happy with my optimisation that if a page needs to be read in, > we just skip it and pretend we did read-pin-unpin on it? If it's not in buffers then it cannot be pinned, so I concur that that is safe. regards, tom lane