Thread: VACUUM in SP-GiST
I started reading the spgist vacuum code last night, and didn't like it at all. I found a number of smaller issues, but it seems to me that the design is just fundamentally wrong. Unless I'm misunderstanding it, the approach is to recursively traverse the tree in sort of the same way that a full-index search would do, except that it continues to hold lock on the immediate parent page of whichever page is currently being visited, so that it can update the downlink if it has to delete the first leaf tuple of a chain. I've got several complaints about that: 1. Holding exclusive locks on upper pages while we visit possibly-hundreds of leaf pages seems pretty bad from a concurrency standpoint. It doesn't hold locks all the way up to the root, so it's not a complete loss, but even a first-level inner page can have a lot of children. 2. I do not trust this code to visit every leaf page (which, if it failed to do so and thereby missed deleting a heap reference, would result in a silently corrupt index). Unlike a regular index search, which makes a list of lower-level targets while examining an inner tuple just once, this code depends on the assumption that it can reacquire lock on an upper page and then resychronize itself with whatever's been done meanwhile to the inner tuple it was looking at. I don't trust that. The mechanism that's being used relies on page LSN having been changed if anything was changed on the page, which does not work in unlogged tables. Furthermore, if the inner tuple did change, it has to rescan *everything* that the inner tuple leads to. That's horrendously inefficient, and it's not even clear that it would ever terminate in the face of a constant stream of concurrent updates. 3. Even if it all worked, it would be slow as can be, because it requires a whole lot of nonsequential accesses to the index. For instance, the same leaf page might be visited many times (once for each leaf chain on the page), and not necessarily close together either. Also, the code doesn't seem to perform any cleanup at all on inner pages. I am not expecting it to try to get rid of childless inner tuples, but surely it ought to at least convert old redirects to dead and get rid of dead tuples at the end of the page, much as for leaf pages? What I'm envisioning doing instead is making a single sequential scan over the entire index. On both leaf and inner pages we could clean redirects and remove end dead tuples. On leaf pages we'd also check for tuples to be deleted. There's no real difficulty in removing deleted tuples as long as they are not at the heads of their chains. (The code would have to reverse-engineer which tuples are at the heads of their chains, but that's easily done while scanning the page; we just make a reverse-lookup map for the nextOffset pointers.) If we have to delete a tuple at the head of its chain, but there's still at least one live tuple in its chain, we could reshuffle the tuples to bring another live tuple to that same offset number, thereby not invalidating the parent downlink. The only hard part is what to do when we delete all members of a chain: we can't reset the parent downlink because we do not know where it is. What I propose to do in that case is install a variant form of dead tuple that just indicates "this chain is empty". One way to represent that would be a redirect tuple pointing nowhere, but I think it would be cleaner to repurpose the isDead and isRedirect bits as a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and this fourth state for a dead tuple that is not recyclable because we know there's a link to it somewhere. A scan would ignore such a tuple. An insertion that arrives at such a tuple could either replace it (if there's enough room on the page) or convert it to a redirect (if not). Last night I was thinking the fourth state could be named TOMBSTONE, but maybe it'd be better to use DEAD for this case and rename the existing "removable dead tuple" state to PLACEHOLDER, since that case only exists to preserve the offset numbers of other tuples on the page. Comments? regards, tom lane
hi, > I started reading the spgist vacuum code last night, and didn't like it > at all. I found a number of smaller issues, but it seems to me that > the design is just fundamentally wrong. Unless I'm misunderstanding it, > the approach is to recursively traverse the tree in sort of the same way > that a full-index search would do, except that it continues to hold lock > on the immediate parent page of whichever page is currently being > visited, so that it can update the downlink if it has to delete the > first leaf tuple of a chain. I've got several complaints about that: > > 1. Holding exclusive locks on upper pages while we visit > possibly-hundreds of leaf pages seems pretty bad from a concurrency > standpoint. It doesn't hold locks all the way up to the root, so it's > not a complete loss, but even a first-level inner page can have a lot of > children. > > 2. I do not trust this code to visit every leaf page (which, if it > failed to do so and thereby missed deleting a heap reference, would > result in a silently corrupt index). Unlike a regular index search, > which makes a list of lower-level targets while examining an inner tuple > just once, this code depends on the assumption that it can reacquire > lock on an upper page and then resychronize itself with whatever's been > done meanwhile to the inner tuple it was looking at. I don't trust > that. The mechanism that's being used relies on page LSN having been > changed if anything was changed on the page, which does not work in > unlogged tables. Furthermore, if the inner tuple did change, it has to > rescan *everything* that the inner tuple leads to. That's horrendously > inefficient, and it's not even clear that it would ever terminate in the > face of a constant stream of concurrent updates. > > 3. Even if it all worked, it would be slow as can be, because it > requires a whole lot of nonsequential accesses to the index. For > instance, the same leaf page might be visited many times (once for > each leaf chain on the page), and not necessarily close together either. > > Also, the code doesn't seem to perform any cleanup at all on inner > pages. I am not expecting it to try to get rid of childless inner > tuples, but surely it ought to at least convert old redirects to dead > and get rid of dead tuples at the end of the page, much as for leaf > pages? > > What I'm envisioning doing instead is making a single sequential scan > over the entire index. On both leaf and inner pages we could clean > redirects and remove end dead tuples. On leaf pages we'd also check for > tuples to be deleted. There's no real difficulty in removing deleted > tuples as long as they are not at the heads of their chains. (The code > would have to reverse-engineer which tuples are at the heads of their > chains, but that's easily done while scanning the page; we just make a > reverse-lookup map for the nextOffset pointers.) If we have to delete a > tuple at the head of its chain, but there's still at least one live > tuple in its chain, we could reshuffle the tuples to bring another live > tuple to that same offset number, thereby not invalidating the parent > downlink. The only hard part is what to do when we delete all members > of a chain: we can't reset the parent downlink because we do not know > where it is. What I propose to do in that case is install a variant > form of dead tuple that just indicates "this chain is empty". One way > to represent that would be a redirect tuple pointing nowhere, but I > think it would be cleaner to repurpose the isDead and isRedirect bits as > a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and > this fourth state for a dead tuple that is not recyclable because we > know there's a link to it somewhere. A scan would ignore such a tuple. > An insertion that arrives at such a tuple could either replace it (if > there's enough room on the page) or convert it to a redirect (if not). > > Last night I was thinking the fourth state could be named TOMBSTONE, > but maybe it'd be better to use DEAD for this case and rename the > existing "removable dead tuple" state to PLACEHOLDER, since that case > only exists to preserve the offset numbers of other tuples on the > page. > > Comments? with a concurrent split moving leaf tuples between pages, can't it make bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY create duplicates? YAMAMOTO Takashi > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: >> [ tgl's redesign of spgist vacuum algorithm ] > with a concurrent split moving leaf tuples between pages, can't it make > bulkdelete fail reporting some TIDs Yeah, you're right :-(. I've been chewing on this problem for awhile and I think I see a solution. We only need to worry about cases where leaf tuples have been moved from one page to a lower-numbered page, not about rearrangements of inner pages. If leaf tuples were moved off a leaf page, the mover should have left a REDIRECT tuple behind. So we can fix things by having VACUUM act like a search scan when it comes across a REDIRECT that was made since since the VACUUM scan started. In detail: 1. When we are scanning a leaf page and we find a REDIRECT whose TID shows it was made since VACUUM started, add that TID to a pending-list of tuple chains we need to recheck. (We can't go visit the TID immediately, because we don't want VACUUM holding lock on more than one buffer at a time. Instead, we'll process the pending-list after finishing with the current page.) 2. Between pages of the normal sequential scan of the index, process pending-list entries until they're all gone. 3. To process a pending-list entry, visit and lock the indicated index page. If it's a leaf page, run the normal VACUUM processing on that page, then mark the pending-list entry done. (Note: we could just process the single tuple chain pointed to by the TID, but it seems better to process the whole page, in hopes that we'll only have to change it once and generate one WAL record, rather than possibly process several chains one-at-a-time. However, REDIRECTs should not be added to the pending-list unless they are right at the TID we intended to visit. A redirect in another chain has or will be processed by the regular scan.) If it's an inner page, examine the inner tuple pointed to by the pending-list TID, and add all the downlink TIDs in that inner tuple to the pending list. The possibility that a leaf REDIRECT leads to an inner tuple (as a consequence of a PickSplit operation) means that we can't apply some apparently-obvious optimizations; for example, if a REDIRECT TID points to a page beyond the current sequential scan point, we cannot simply ignore it because it might be pointing at an inner rather than leaf page. (We could possibly drop such a page without processing it once we'd locked it and verified it's a leaf page, but I tend to think we might as well process it if we've gotten that far.) I complained that the original spgist vacuum algorithm might not terminate in the face of continuous concurrent updates, and this one seems to have some risk of similar misbehavior. However, we can fix that with suitable management of the pending list. Specifically, I have in mind that we'll not remove pending-list entries, only mark them "done", until the entire list is "done" and we're ready to resume the main sequential scan; and that we will check for new TIDs being already in the pending-list and not make duplicate entries. This should be enough to prevent any endless loops. This all sounds pretty messy, but in practice the pending list should almost always be short, so there won't be a lot of performance lost. Comments? regards, tom lane