Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
Date
Msg-id CAH2-WzkJzmyXgmxrNk7FXJyy7AHQs=hsOQVU3UP0ks6S25J7Jw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> e2c79e14 was to fix a pre-existing bug, but probably e9568083 made that bug
> easier to hit than it was before.  (Which is not to say that  e9568083 can't
> contain bugs of its own, of course)

I get that. I think that the same thing may have happened again, in
fact. A pending list recycling interlock bug may have merely been
unmasked by your commit e9568083; I'm speculating that your commit
made it more likely to hit in practice, just as with the earlier bug
you mentioned.

As you know, there is a significant history of VACUUM page recycling
bugs in GIN. I wouldn't be surprised if we found one more in the
pending list logic. Consider commit ac4ab97e, a bugfix from late 2013
for posting list page deletion, where the current posting list locking
protocol was more or less established. The commit message says:

"""
...
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the [posting] tree.
...
"""

I believe that this commit had GIN avoid deletion of the leftmost
branch of posting trees because that makes it safe to get to the
posting tree root from a raw root block number (a raw block number
from the B-tree index constructed over key values). The buffer lock
pairing/crabbing this commit adds to posting lists (similar to the
pairing/crabbing that pending lists had from day one, when first added
in 2011) can prevent a concurrent deletion once you reach the posting
tree root. But, as I said, you still need to reliably get to the root
in the first place, which the "don't delete leftmost" rule ensures (if
you can never delete it, you can never recycle it, and no reader gets
confused).

It's not clear why it's safe to recycle the pending list pages at all
(though deletion alone might well be okay, because readers can notice
a page is deleted if it isn't yet recycled). Why is it okay that we
can jump to the first block from the meta page in the face of
concurrent recycling?

Looking at scanPendingInsert(), it does appear that readers do buffer
lock crabbing/coupling for the meta page and the first pending page.
So that's an encouraging sign. However, ginInsertCleanup() *also* uses
shared buffer locks for both meta page and list head page (never
exclusive buffer locks) in exactly the same way when first
establishing what block is at the head of the list to be zapped. It's
acting as if it is a reader, but it's actually a writer. I don't
follow why this is correct. It looks like scanPendingInsert() can
decide that it knows where the head of the pending list is *after*
ginInsertCleanup() has already decided that that same head of the list
block needs to possibly be deleted/recycled. Have I missed something?

We really ought to make the asserts on gin page type into "can't
happen" elog errors, as a defensive measure, in both
scanPendingInsert() and ginInsertCleanup(). The 2013 bug fix actually
did this for the posting tree, but didn't touch similar pending list
code.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Bug in to_timestamp().
Next
From: Michael Paquier
Date:
Subject: Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256