Re: VACUUM in SP-GiST - Mailing list pgsql-hackers

From Tom Lane
Subject Re: VACUUM in SP-GiST
Date
Msg-id 26043.1331523168@sss.pgh.pa.us
Whole thread Raw
In response to Re: VACUUM in SP-GiST  (yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi))
List pgsql-hackers
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
>> [ tgl's redesign of spgist vacuum algorithm ]

> with a concurrent split moving leaf tuples between pages, can't it make
> bulkdelete fail reporting some TIDs

Yeah, you're right :-(.  I've been chewing on this problem for awhile
and I think I see a solution.  We only need to worry about cases where
leaf tuples have been moved from one page to a lower-numbered page,
not about rearrangements of inner pages.  If leaf tuples were moved
off a leaf page, the mover should have left a REDIRECT tuple behind.
So we can fix things by having VACUUM act like a search scan when
it comes across a REDIRECT that was made since since the VACUUM scan
started.  In detail:

1. When we are scanning a leaf page and we find a REDIRECT whose TID
shows it was made since VACUUM started, add that TID to a pending-list
of tuple chains we need to recheck.  (We can't go visit the TID
immediately, because we don't want VACUUM holding lock on more than one
buffer at a time.  Instead, we'll process the pending-list after
finishing with the current page.)

2. Between pages of the normal sequential scan of the index, process
pending-list entries until they're all gone.

3. To process a pending-list entry, visit and lock the indicated index
page.  If it's a leaf page, run the normal VACUUM processing on that
page, then mark the pending-list entry done.  (Note: we could just
process the single tuple chain pointed to by the TID, but it seems
better to process the whole page, in hopes that we'll only have to
change it once and generate one WAL record, rather than possibly process
several chains one-at-a-time.  However, REDIRECTs should not be added to
the pending-list unless they are right at the TID we intended to visit.
A redirect in another chain has or will be processed by the regular
scan.)  If it's an inner page, examine the inner tuple pointed to by the
pending-list TID, and add all the downlink TIDs in that inner tuple to
the pending list.

The possibility that a leaf REDIRECT leads to an inner tuple (as a
consequence of a PickSplit operation) means that we can't apply some
apparently-obvious optimizations; for example, if a REDIRECT TID points
to a page beyond the current sequential scan point, we cannot simply
ignore it because it might be pointing at an inner rather than leaf
page.  (We could possibly drop such a page without processing it once we'd
locked it and verified it's a leaf page, but I tend to think we might as
well process it if we've gotten that far.)

I complained that the original spgist vacuum algorithm might not
terminate in the face of continuous concurrent updates, and this one
seems to have some risk of similar misbehavior.  However, we can fix
that with suitable management of the pending list.  Specifically, I have
in mind that we'll not remove pending-list entries, only mark them
"done", until the entire list is "done" and we're ready to resume the
main sequential scan; and that we will check for new TIDs being already
in the pending-list and not make duplicate entries.  This should be
enough to prevent any endless loops.

This all sounds pretty messy, but in practice the pending list should
almost always be short, so there won't be a lot of performance lost.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Review of pg_archivecleanup -x option patch
Next
From: Fujii Masao
Date:
Subject: Re: wal_buffers, redux