Re: VACUUM in SP-GiST - Mailing list pgsql-hackers

From yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Subject Re: VACUUM in SP-GiST
Date
Msg-id 20120116214335.D88C414A22B@mail.netbsd.org
Whole thread Raw
In response to VACUUM in SP-GiST  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: VACUUM in SP-GiST  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
hi,

> I started reading the spgist vacuum code last night, and didn't like it
> at all.  I found a number of smaller issues, but it seems to me that
> the design is just fundamentally wrong.  Unless I'm misunderstanding it,
> the approach is to recursively traverse the tree in sort of the same way
> that a full-index search would do, except that it continues to hold lock
> on the immediate parent page of whichever page is currently being
> visited, so that it can update the downlink if it has to delete the
> first leaf tuple of a chain.  I've got several complaints about that:
> 
> 1. Holding exclusive locks on upper pages while we visit
> possibly-hundreds of leaf pages seems pretty bad from a concurrency
> standpoint.  It doesn't hold locks all the way up to the root, so it's
> not a complete loss, but even a first-level inner page can have a lot of
> children.
> 
> 2. I do not trust this code to visit every leaf page (which, if it
> failed to do so and thereby missed deleting a heap reference, would
> result in a silently corrupt index).  Unlike a regular index search,
> which makes a list of lower-level targets while examining an inner tuple
> just once, this code depends on the assumption that it can reacquire
> lock on an upper page and then resychronize itself with whatever's been
> done meanwhile to the inner tuple it was looking at.  I don't trust
> that.  The mechanism that's being used relies on page LSN having been
> changed if anything was changed on the page, which does not work in
> unlogged tables.  Furthermore, if the inner tuple did change, it has to
> rescan *everything* that the inner tuple leads to.  That's horrendously
> inefficient, and it's not even clear that it would ever terminate in the
> face of a constant stream of concurrent updates.
> 
> 3. Even if it all worked, it would be slow as can be, because it
> requires a whole lot of nonsequential accesses to the index.  For
> instance, the same leaf page might be visited many times (once for
> each leaf chain on the page), and not necessarily close together either.
> 
> Also, the code doesn't seem to perform any cleanup at all on inner
> pages.  I am not expecting it to try to get rid of childless inner
> tuples, but surely it ought to at least convert old redirects to dead
> and get rid of dead tuples at the end of the page, much as for leaf
> pages?
> 
> What I'm envisioning doing instead is making a single sequential scan
> over the entire index.  On both leaf and inner pages we could clean
> redirects and remove end dead tuples.  On leaf pages we'd also check for
> tuples to be deleted.  There's no real difficulty in removing deleted
> tuples as long as they are not at the heads of their chains.  (The code
> would have to reverse-engineer which tuples are at the heads of their
> chains, but that's easily done while scanning the page; we just make a
> reverse-lookup map for the nextOffset pointers.)  If we have to delete a
> tuple at the head of its chain, but there's still at least one live
> tuple in its chain, we could reshuffle the tuples to bring another live
> tuple to that same offset number, thereby not invalidating the parent
> downlink.  The only hard part is what to do when we delete all members
> of a chain: we can't reset the parent downlink because we do not know
> where it is.  What I propose to do in that case is install a variant
> form of dead tuple that just indicates "this chain is empty".  One way
> to represent that would be a redirect tuple pointing nowhere, but I
> think it would be cleaner to repurpose the isDead and isRedirect bits as
> a two-bit field with four states.  We'd have LIVE, DEAD, REDIRECT, and
> this fourth state for a dead tuple that is not recyclable because we
> know there's a link to it somewhere.  A scan would ignore such a tuple.
> An insertion that arrives at such a tuple could either replace it (if
> there's enough room on the page) or convert it to a redirect (if not).
> 
> Last night I was thinking the fourth state could be named TOMBSTONE,
> but maybe it'd be better to use DEAD for this case and rename the
> existing "removable dead tuple" state to PLACEHOLDER, since that case
> only exists to preserve the offset numbers of other tuples on the
> page.
> 
> Comments?

with a concurrent split moving leaf tuples between pages, can't it make
bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY
create duplicates?

YAMAMOTO Takashi

> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: WAL Restore process during recovery
Next
From: Alvaro Herrera
Date:
Subject: Re: automating CF submissions (was xlog location arithmetic)