Re: snapshot too old, configured by time - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: snapshot too old, configured by time
Date
Msg-id CAM3SWZSiL52kGFPVVST=hA=oA+-YWEHQAPK_pVocReBoOEeQqQ@mail.gmail.com
Whole thread Raw
In response to Re: snapshot too old, configured by time  (Kevin Grittner <kgrittn@gmail.com>)
Responses Re: snapshot too old, configured by time
List pgsql-hackers
On Wed, Mar 30, 2016 at 11:53 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> When the initial "proof of concept" patch was tested by the
> customer, it was not effective due to issues related to what you
> raise.  Autovacuum workers were blocking due to the page pins for
> scans using these old snapshots, causing the bloat to accumulate in
> spite of this particular patch.  This was addressed, at least to a
> degree sufficient for this customer, with this patch:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
>
> Basically, for most common cases the "super-exclusive" locking has
> been eliminated from btree logic; and I have been happy to see that
> Simon has been working on dealing with the corner cases where I
> hadn't rooted it out.
>
>> I worry that something weird could happen there. For example, perhaps
>> the page LSN on what is actually a newly recycled page could be set
>> such that the backend following a stale right spuriously raises a
>> "snapshot too old" error.
>
> That particular detail doesn't seem to be a realistic concern,
> though -- if a page has been deleted, assigned to a new place in
> the index with an LSN corresponding to that action, it would be a
> pretty big bug if a right-pointer still referenced it.

Yes, that would be a big bug. But I wasn't talking about
"super-exclusive" locking. Rather, I was talking about the general way
in which index scans are guaranteed to not land on an already-recycled
page (not a half-dead page, and not a fully deleted page -- a fully
reclaimed/recycled page). This works without buffer pins or buffer
locks needing to be held at all -- there is a global interlock against
page *recycling* based on RecentGlobalXmin, per the nbtree README. So,
this vague concern of mine is about VACUUM's B-Tree page recycling.

During an index scan, we expect to be able to land on the next page,
and to be at least able to reason about it being deleted, even though
we don't hold a pin on anything for a period. We certainly are shy
about explaining all this, but if you look at a routine like
_bt_search() carefully (the routine that is used to descend a B-Tree),
it doesn't actually hold a pin concurrently, as we drop a level (and
certainly not a buffer lock, either). The next page/block should be
substantively the same page as expected from the downlink (or right
link) we followed, entirely because of the RecentGlobalXmin interlock.
Backwards scans also rely on this.

This is just a vague concern, and possibly this is completely
irrelevant. I haven't read the patch.

>> I suggest you consider making amcheck [1] a part of your testing
>> strategy. I think that this patch is a good idea, and I'd be happy to
>> take feedback from you on how to make amcheck more effective for
>> testing this patch in particular.
>
> I'm not sure how that would fit in; could you elaborate?

Well, amcheck is a tool that in essence makes sure that B-Trees look
structurally sound, and respect invariants like having every item on
each page in logical order. That could catch a bug of the kind I just
described, because it's quite likely that the recycled page would
happen to have items that didn't comport with the ordering on the
page. The block has been reused essentially at random. Importantly,
amcheck can catch this without anything more than an AccessShareLock,
so you have some hope of catching this kind of race condition (the
stale downlink that you followed to get to the
spuriously-recycled-early page doesn't stay stale for long). Or, maybe
it would happen to catch some other random problem. Difficult to say.

Again, this is based on a speculation that might be wrong. But it's
worth considering.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: snapshot too old, configured by time
Next
From: Alvaro Herrera
Date:
Subject: Re: snapshot too old, configured by time