Thread: VACUUM in SP-GiST

VACUUM in SP-GiST

From

Tom Lane

Date:

14 December 2011, 16:14:11

I started reading the spgist vacuum code last night, and didn't like it
at all.  I found a number of smaller issues, but it seems to me that
the design is just fundamentally wrong.  Unless I'm misunderstanding it,
the approach is to recursively traverse the tree in sort of the same way
that a full-index search would do, except that it continues to hold lock
on the immediate parent page of whichever page is currently being
visited, so that it can update the downlink if it has to delete the
first leaf tuple of a chain.  I've got several complaints about that:

1. Holding exclusive locks on upper pages while we visit
possibly-hundreds of leaf pages seems pretty bad from a concurrency
standpoint.  It doesn't hold locks all the way up to the root, so it's
not a complete loss, but even a first-level inner page can have a lot of
children.

2. I do not trust this code to visit every leaf page (which, if it
failed to do so and thereby missed deleting a heap reference, would
result in a silently corrupt index).  Unlike a regular index search,
which makes a list of lower-level targets while examining an inner tuple
just once, this code depends on the assumption that it can reacquire
lock on an upper page and then resychronize itself with whatever's been
done meanwhile to the inner tuple it was looking at.  I don't trust
that.  The mechanism that's being used relies on page LSN having been
changed if anything was changed on the page, which does not work in
unlogged tables.  Furthermore, if the inner tuple did change, it has to
rescan *everything* that the inner tuple leads to.  That's horrendously
inefficient, and it's not even clear that it would ever terminate in the
face of a constant stream of concurrent updates.

3. Even if it all worked, it would be slow as can be, because it
requires a whole lot of nonsequential accesses to the index.  For
instance, the same leaf page might be visited many times (once for
each leaf chain on the page), and not necessarily close together either.

Also, the code doesn't seem to perform any cleanup at all on inner
pages.  I am not expecting it to try to get rid of childless inner
tuples, but surely it ought to at least convert old redirects to dead
and get rid of dead tuples at the end of the page, much as for leaf
pages?

What I'm envisioning doing instead is making a single sequential scan
over the entire index.  On both leaf and inner pages we could clean
redirects and remove end dead tuples.  On leaf pages we'd also check for
tuples to be deleted.  There's no real difficulty in removing deleted
tuples as long as they are not at the heads of their chains.  (The code
would have to reverse-engineer which tuples are at the heads of their
chains, but that's easily done while scanning the page; we just make a
reverse-lookup map for the nextOffset pointers.)  If we have to delete a
tuple at the head of its chain, but there's still at least one live
tuple in its chain, we could reshuffle the tuples to bring another live
tuple to that same offset number, thereby not invalidating the parent
downlink.  The only hard part is what to do when we delete all members
of a chain: we can't reset the parent downlink because we do not know
where it is.  What I propose to do in that case is install a variant
form of dead tuple that just indicates "this chain is empty".  One way
to represent that would be a redirect tuple pointing nowhere, but I
think it would be cleaner to repurpose the isDead and isRedirect bits as
a two-bit field with four states.  We'd have LIVE, DEAD, REDIRECT, and
this fourth state for a dead tuple that is not recyclable because we
know there's a link to it somewhere.  A scan would ignore such a tuple.
An insertion that arrives at such a tuple could either replace it (if
there's enough room on the page) or convert it to a redirect (if not).

Last night I was thinking the fourth state could be named TOMBSTONE,
but maybe it'd be better to use DEAD for this case and rename the
existing "removable dead tuple" state to PLACEHOLDER, since that case
only exists to preserve the offset numbers of other tuples on the
page.

Comments?
        regards, tom lane

Re: VACUUM in SP-GiST

From

yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)

Date:

17 January 2012, 10:20:07

hi,

> I started reading the spgist vacuum code last night, and didn't like it
> at all.  I found a number of smaller issues, but it seems to me that
> the design is just fundamentally wrong.  Unless I'm misunderstanding it,
> the approach is to recursively traverse the tree in sort of the same way
> that a full-index search would do, except that it continues to hold lock
> on the immediate parent page of whichever page is currently being
> visited, so that it can update the downlink if it has to delete the
> first leaf tuple of a chain.  I've got several complaints about that:
> 
> 1. Holding exclusive locks on upper pages while we visit
> possibly-hundreds of leaf pages seems pretty bad from a concurrency
> standpoint.  It doesn't hold locks all the way up to the root, so it's
> not a complete loss, but even a first-level inner page can have a lot of
> children.
> 
> 2. I do not trust this code to visit every leaf page (which, if it
> failed to do so and thereby missed deleting a heap reference, would
> result in a silently corrupt index).  Unlike a regular index search,
> which makes a list of lower-level targets while examining an inner tuple
> just once, this code depends on the assumption that it can reacquire
> lock on an upper page and then resychronize itself with whatever's been
> done meanwhile to the inner tuple it was looking at.  I don't trust
> that.  The mechanism that's being used relies on page LSN having been
> changed if anything was changed on the page, which does not work in
> unlogged tables.  Furthermore, if the inner tuple did change, it has to
> rescan *everything* that the inner tuple leads to.  That's horrendously
> inefficient, and it's not even clear that it would ever terminate in the
> face of a constant stream of concurrent updates.
> 
> 3. Even if it all worked, it would be slow as can be, because it
> requires a whole lot of nonsequential accesses to the index.  For
> instance, the same leaf page might be visited many times (once for
> each leaf chain on the page), and not necessarily close together either.
> 
> Also, the code doesn't seem to perform any cleanup at all on inner
> pages.  I am not expecting it to try to get rid of childless inner
> tuples, but surely it ought to at least convert old redirects to dead
> and get rid of dead tuples at the end of the page, much as for leaf
> pages?
> 
> What I'm envisioning doing instead is making a single sequential scan
> over the entire index.  On both leaf and inner pages we could clean
> redirects and remove end dead tuples.  On leaf pages we'd also check for
> tuples to be deleted.  There's no real difficulty in removing deleted
> tuples as long as they are not at the heads of their chains.  (The code
> would have to reverse-engineer which tuples are at the heads of their
> chains, but that's easily done while scanning the page; we just make a
> reverse-lookup map for the nextOffset pointers.)  If we have to delete a
> tuple at the head of its chain, but there's still at least one live
> tuple in its chain, we could reshuffle the tuples to bring another live
> tuple to that same offset number, thereby not invalidating the parent
> downlink.  The only hard part is what to do when we delete all members
> of a chain: we can't reset the parent downlink because we do not know
> where it is.  What I propose to do in that case is install a variant
> form of dead tuple that just indicates "this chain is empty".  One way
> to represent that would be a redirect tuple pointing nowhere, but I
> think it would be cleaner to repurpose the isDead and isRedirect bits as
> a two-bit field with four states.  We'd have LIVE, DEAD, REDIRECT, and
> this fourth state for a dead tuple that is not recyclable because we
> know there's a link to it somewhere.  A scan would ignore such a tuple.
> An insertion that arrives at such a tuple could either replace it (if
> there's enough room on the page) or convert it to a redirect (if not).
> 
> Last night I was thinking the fourth state could be named TOMBSTONE,
> but maybe it'd be better to use DEAD for this case and rename the
> existing "removable dead tuple" state to PLACEHOLDER, since that case
> only exists to preserve the offset numbers of other tuples on the
> page.
> 
> Comments?

with a concurrent split moving leaf tuples between pages, can't it make
bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY
create duplicates?

YAMAMOTO Takashi

> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: VACUUM in SP-GiST

From

Tom Lane

Date:

12 March 2012, 03:33:05

yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
>> [ tgl's redesign of spgist vacuum algorithm ]

> with a concurrent split moving leaf tuples between pages, can't it make
> bulkdelete fail reporting some TIDs

Yeah, you're right :-(.  I've been chewing on this problem for awhile
and I think I see a solution.  We only need to worry about cases where
leaf tuples have been moved from one page to a lower-numbered page,
not about rearrangements of inner pages.  If leaf tuples were moved
off a leaf page, the mover should have left a REDIRECT tuple behind.
So we can fix things by having VACUUM act like a search scan when
it comes across a REDIRECT that was made since since the VACUUM scan
started.  In detail:

1. When we are scanning a leaf page and we find a REDIRECT whose TID
shows it was made since VACUUM started, add that TID to a pending-list
of tuple chains we need to recheck.  (We can't go visit the TID
immediately, because we don't want VACUUM holding lock on more than one
buffer at a time.  Instead, we'll process the pending-list after
finishing with the current page.)

2. Between pages of the normal sequential scan of the index, process
pending-list entries until they're all gone.

3. To process a pending-list entry, visit and lock the indicated index
page.  If it's a leaf page, run the normal VACUUM processing on that
page, then mark the pending-list entry done.  (Note: we could just
process the single tuple chain pointed to by the TID, but it seems
better to process the whole page, in hopes that we'll only have to
change it once and generate one WAL record, rather than possibly process
several chains one-at-a-time.  However, REDIRECTs should not be added to
the pending-list unless they are right at the TID we intended to visit.
A redirect in another chain has or will be processed by the regular
scan.)  If it's an inner page, examine the inner tuple pointed to by the
pending-list TID, and add all the downlink TIDs in that inner tuple to
the pending list.

The possibility that a leaf REDIRECT leads to an inner tuple (as a
consequence of a PickSplit operation) means that we can't apply some
apparently-obvious optimizations; for example, if a REDIRECT TID points
to a page beyond the current sequential scan point, we cannot simply
ignore it because it might be pointing at an inner rather than leaf
page.  (We could possibly drop such a page without processing it once we'd
locked it and verified it's a leaf page, but I tend to think we might as
well process it if we've gotten that far.)

I complained that the original spgist vacuum algorithm might not
terminate in the face of continuous concurrent updates, and this one
seems to have some risk of similar misbehavior.  However, we can fix
that with suitable management of the pending list.  Specifically, I have
in mind that we'll not remove pending-list entries, only mark them
"done", until the entire list is "done" and we're ready to resume the
main sequential scan; and that we will check for new TIDs being already
in the pending-list and not make duplicate entries.  This should be
enough to prevent any endless loops.

This all sounds pretty messy, but in practice the pending list should
almost always be short, so there won't be a lot of performance lost.

Comments?
        regards, tom lane