Thread: Unsplitting btree index leaf pages
When we discussed online REINDEX recently we focused on the REINDEX command itself rather than look at alternative approaches. One reason to REINDEX is because of index page splits getting things out of sequence and generally bloating the index. When we VACUUM, each index is scanned in logical order. While we scan, if we found two adjacent pages, both of which have less than (say) 40% rows, we could re-join or "unsplit" those pages together. The index blocks are fully locked during the read anyway and there is no MVCC problem with moving index rows between blocks. All we have to do is to lock both blocks, having locked them in the correct order. The rows would always be moved to the lowest physical block id, so that data would naturally migrate towards the start of the index file. Blocks would then be marked half-dead just as if they had just had their last index row removed by the vacuum. We could start the scan by locking block 1 AND block2, then scan forward always holding 2 locks as we go. That way we would not need to unlock and relock the blocks to lock two blocks. The concurrency loss would not be that great, but we would gain the ability to unsplit the two blocks into one. If we do this, we would could possibly avoid the need to full REINDEX entirely. If this method checks out we could do one of: - make VACUUM do this always - add an option for REINDEX: CLEAN/COMPRESS/VACUUM etc to do this upon command only Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > While we scan, if we found two adjacent pages, both of which have less > than (say) 40% rows, we could re-join or "unsplit" those pages together. Curiously enough, this has been thought of before. It is not as easy as you think, or it would have been done the first time around. nbtree/README explains why: : We consider deleting an entire page from the btree only when it's become : completely empty of items. (Merging partly-full pages would allow better : space reuse, but it seems impractical to move existing data items left or : right to make this happen --- a scan moving in the opposite direction : might miss the items if so. We could do it during VACUUM FULL, though.) regards, tom lane
On Wed, 2005-12-21 at 19:07 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > While we scan, if we found two adjacent pages, both of which have less > > than (say) 40% rows, we could re-join or "unsplit" those pages together. > > Curiously enough, this has been thought of before. It is not as easy > as you think, or it would have been done the first time around. > nbtree/README explains why: > > : We consider deleting an entire page from the btree only when it's become > : completely empty of items. (Merging partly-full pages would allow better > : space reuse, but it seems impractical to move existing data items left or > : right to make this happen --- a scan moving in the opposite direction > : might miss the items if so. We could do it during VACUUM FULL, though.) Sorry, I missed that. "Seems impractical"? Complex, yes. But I think it sounds a lot simpler than the online REINDEX scheme... which is also the reason enhancing VACUUM FULL doesn't appeal. During the first VACUUM index scan, we can identify pages that are candidates for reorganisation. Then during the second physical scan that follows we can reorg pages in the same manner we delete them. We identify a page during VACUUM for reorg if: - it is < 20% full and we want to write this page - the following page is < 20% full and we want to write this page - it has a higher physical block number than the following page That way we aren't super aggressive about reorg, but the cases we do pick have a tendency to keep the index clustered. (Perhaps that could be settable via an index optional parameter). That definition also means we have almost no additional I/O over what the VACUUM would have written anyway. Page deletion requires us to lock left, lock right, lock above. That is exactly the same if we have identified the page for reorg, since once we have locked left and right, we just move rows to one or both of the those other blocks, then perform the marking half-dead. I know you've considered these things deeply, but my viewpoint is that when what you have is already very good the only way forward consists of considering seemingly foolish thoughts: backtracking. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Sorry, I missed that. And you evidently still didn't understand it. Locking both pages does not fix the problem, because it doesn't guarantee that there's not a concurrent indexscan in flight from one to the other. If you move items from one page to the other in the opposite direction from the way the scan is going, then it will miss those items. If we try to fix this by making scans lock one page before releasing the previous, then we'll create a bunch of deadlock cases. regards, tom lane
On Thu, Dec 22, 2005 at 10:40:24AM -0500, Tom Lane wrote: > And you evidently still didn't understand it. Locking both pages does > not fix the problem, because it doesn't guarantee that there's not a > concurrent indexscan in flight from one to the other. If you move items > from one page to the other in the opposite direction from the way the > scan is going, then it will miss those items. If we try to fix this by > making scans lock one page before releasing the previous, then we'll > create a bunch of deadlock cases. I've been thinking, maybe there's a lockless way of going about this? Have some kind of index modification identifier that you set at the beginning of the index scan. What you do is mark the tuples you want to move with and IMI (I'm trying to avoid the word transaction here) and then copy the tuples to the new page with IMI+1. Any currently running index scan will notice the higher IMI and ignore them. When all old index scans are done, you can remove the old ones. It's sort of a mini-MVCC for indexes except you could probably just use a few states: visible to all, visible to current scan, invisible to current scan. Or use two bits to represent frozen, 1, 2 and 3. A plain VACUUM could do the state transistions each time it moves through the index. The downsides are probably that it's a pain to make it work concurrently and requires writing each index page at least twice. But it's a thought... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout wrote: > The downsides are probably that it's a pain to make it work > concurrently and requires writing each index page at least twice. But > it's a thought... We already do something similar for page deletions. Empty pages are not deleted right away, but they are marked with BTP_DEAD, and then deleted on a subsequent vacuum. Or something like that, I don't remember the exact details. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > We already do something similar for page deletions. Empty pages are not > deleted right away, but they are marked with BTP_DEAD, and then deleted > on a subsequent vacuum. Or something like that, I don't remember the > exact details. Right, and the reason for that is exactly that there might be a concurrent indexscan already "in flight" to the newly-dead page. We must wait to recycle the page until we are certain no such scans remain. It doesn't matter whether a concurrent indexscan visits the dead page or not, *because it's empty* and so there's nothing to miss. So there's no race condition. But if you try to move valid data across pages then there is a race condition. regards, tom lane
On Thu, 2005-12-22 at 10:40 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Sorry, I missed that. > > And you evidently still didn't understand it. Locking both pages does > not fix the problem, because it doesn't guarantee that there's not a > concurrent indexscan in flight from one to the other. If you move items > from one page to the other in the opposite direction from the way the > scan is going, then it will miss those items. If we try to fix this by > making scans lock one page before releasing the previous, then we'll > create a bunch of deadlock cases. Thank you for explaining things; I'm sorry you had to do it twice. I'm ever optimistic, so I try to resist the temptation to stay quiet in case I make a mistake. Best Regards, Simon Riggs
On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >If you move items >from one page to the other in the opposite direction from the way the >scan is going, then it will miss those items. AFAIU the (PG implementaion of the) L&Y method is designed to make scans immune against problems caused by items moving right within the same page and against page splits, i.e. items moving to a *new* right sibling. So making scans work with items moving to an *existing* right sibling doesn't seem out of reach. The code following this comment in _bt_restscan /** The item we're looking for moved right at least one page, so* move right. We are careful here to pin and read-lockthe next* non-dead page before releasing the current one. This ensures* that a concurrent btbulkdelete scan cannotpass our position* --- if it did, it might be able to reach and delete our target* item before we can find it again.*/ looks like it'd work for the case of page merging as well, as long as we are careful to move items always from left to right. BTW, if after having locked both pages we find that we have super-exclusive locks, i.e. nobody else has pins on these pages, we can reorganize much more agressively. It might even be safe to move items to the left page. The parent page might need some special handling, though. ServusManfred
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > We already do something similar for page deletions. Empty pages are not > > deleted right away, but they are marked with BTP_DEAD, and then deleted > > on a subsequent vacuum. Or something like that, I don't remember the > > exact details. > > Right, and the reason for that is exactly that there might be a > concurrent indexscan already "in flight" to the newly-dead page. > We must wait to recycle the page until we are certain no such scans > remain. > > It doesn't matter whether a concurrent indexscan visits the dead > page or not, *because it's empty* and so there's nothing to miss. > So there's no race condition. But if you try to move valid data > across pages then there is a race condition. Hmm... Well, REINDEX is apparently a very expensive operation right now. But how expensive would it be to go through the entire index and perform the index page merge operation being discussed here, and nothing else? If it's fast enough, might it be worthwhile to implement just this alone as a separate maintenance command (e.g., VACUUM INDEX) that acquires the appropriate lock (AccessExclusive, I'd expect) on the index to prevent exactly the issues you're concerned about? If it's fast enough even on large tables, it would be a nice alternative to REINDEX, I'd think. -- Kevin Brown kevin@sysexperts.com
Manfred Koizar <mkoi-pg@aon.at> writes: > BTW, if after having locked both pages we find that we have > super-exclusive locks, i.e. nobody else has pins on these pages, we > can reorganize much more agressively. No, we cannot. I'm getting tired of explaining this. regards, tom lane
Kevin Brown <kevin@sysexperts.com> writes: > Well, REINDEX is apparently a very expensive operation right now. But > how expensive would it be to go through the entire index and perform > the index page merge operation being discussed here, and nothing else? > If it's fast enough, might it be worthwhile to implement just this > alone as a separate maintenance command (e.g., VACUUM INDEX) that > acquires the appropriate lock (AccessExclusive, I'd expect) on the > index to prevent exactly the issues you're concerned about? > If it's fast enough even on large tables, it would be a nice > alternative to REINDEX, I'd think. This would work, but it's hard to tell if it'd be worthwhile short of actually doing an implementation and field-testing it ... regards, tom lane