VACUUM in SP-GiST - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | VACUUM in SP-GiST |
Date | |
Msg-id | 10107.1323882826@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: VACUUM in SP-GiST
|
List | pgsql-hackers |
I started reading the spgist vacuum code last night, and didn't like it at all. I found a number of smaller issues, but it seems to me that the design is just fundamentally wrong. Unless I'm misunderstanding it, the approach is to recursively traverse the tree in sort of the same way that a full-index search would do, except that it continues to hold lock on the immediate parent page of whichever page is currently being visited, so that it can update the downlink if it has to delete the first leaf tuple of a chain. I've got several complaints about that: 1. Holding exclusive locks on upper pages while we visit possibly-hundreds of leaf pages seems pretty bad from a concurrency standpoint. It doesn't hold locks all the way up to the root, so it's not a complete loss, but even a first-level inner page can have a lot of children. 2. I do not trust this code to visit every leaf page (which, if it failed to do so and thereby missed deleting a heap reference, would result in a silently corrupt index). Unlike a regular index search, which makes a list of lower-level targets while examining an inner tuple just once, this code depends on the assumption that it can reacquire lock on an upper page and then resychronize itself with whatever's been done meanwhile to the inner tuple it was looking at. I don't trust that. The mechanism that's being used relies on page LSN having been changed if anything was changed on the page, which does not work in unlogged tables. Furthermore, if the inner tuple did change, it has to rescan *everything* that the inner tuple leads to. That's horrendously inefficient, and it's not even clear that it would ever terminate in the face of a constant stream of concurrent updates. 3. Even if it all worked, it would be slow as can be, because it requires a whole lot of nonsequential accesses to the index. For instance, the same leaf page might be visited many times (once for each leaf chain on the page), and not necessarily close together either. Also, the code doesn't seem to perform any cleanup at all on inner pages. I am not expecting it to try to get rid of childless inner tuples, but surely it ought to at least convert old redirects to dead and get rid of dead tuples at the end of the page, much as for leaf pages? What I'm envisioning doing instead is making a single sequential scan over the entire index. On both leaf and inner pages we could clean redirects and remove end dead tuples. On leaf pages we'd also check for tuples to be deleted. There's no real difficulty in removing deleted tuples as long as they are not at the heads of their chains. (The code would have to reverse-engineer which tuples are at the heads of their chains, but that's easily done while scanning the page; we just make a reverse-lookup map for the nextOffset pointers.) If we have to delete a tuple at the head of its chain, but there's still at least one live tuple in its chain, we could reshuffle the tuples to bring another live tuple to that same offset number, thereby not invalidating the parent downlink. The only hard part is what to do when we delete all members of a chain: we can't reset the parent downlink because we do not know where it is. What I propose to do in that case is install a variant form of dead tuple that just indicates "this chain is empty". One way to represent that would be a redirect tuple pointing nowhere, but I think it would be cleaner to repurpose the isDead and isRedirect bits as a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and this fourth state for a dead tuple that is not recyclable because we know there's a link to it somewhere. A scan would ignore such a tuple. An insertion that arrives at such a tuple could either replace it (if there's enough room on the page) or convert it to a redirect (if not). Last night I was thinking the fourth state could be named TOMBSTONE, but maybe it'd be better to use DEAD for this case and rename the existing "removable dead tuple" state to PLACEHOLDER, since that case only exists to preserve the offset numbers of other tuples on the page. Comments? regards, tom lane
pgsql-hackers by date: