Re: ReadRecentBuffer() doesn't scale well - Mailing list pgsql-hackers

From Andres Freund
Subject Re: ReadRecentBuffer() doesn't scale well
Date
Msg-id 20230627174219.b2r7s4p7o42vsvoq@awork3.anarazel.de
Whole thread Raw
In response to Re: ReadRecentBuffer() doesn't scale well  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: ReadRecentBuffer() doesn't scale well
List pgsql-hackers
Hi,

On 2023-06-27 01:10:25 -0700, Peter Geoghegan wrote:
> On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:
> > > It should be safe to allow searchers to see a version of the root page
> > > that is out of date. The Lehman & Yao design is very permissive about
> > > these things. There aren't any special cases where the general rules
> > > are weakened in some way that might complicate this approach.
> > > Searchers need to check the high key to determine if they need to move
> > > right -- same as always.
> >
> > Wouldn't we at least need a pin on the root page, or hold a snapshot, to
> > defend against page deletions?
>
> You need to hold a snapshot to prevent concurrent page recycling --
> though not page deletion itself (I did say "anything that you'd
> usually think of as an interlock").

I don't think we'd want to have a snapshot for this, that make it much less
beneficial.


> I'm pretty sure that a concurrent page deletion is possible, even when you
> hold a pin on the page.  (Perhaps not, but if not then it's just an accident
> -- a side-effect of the interlock that protects against concurrent heap TID
> recycling.)

Looks like the pin should prevent the danger, but wouldn't be very attractive,
due to blocking vacuum...


I've wondered before about a type of pin that just prevents buffer
replacement, but not cleaning up page contents. I think that'd be beneficial
in quite a few places.


> You can't delete a rightmost page (on any level). Every root page is a
> rightmost page. So the root would have to be split, and then once
> again emptied before it could be deleted -- only then would there be a
> danger of some backend with a locally cached root page having an
> irredeemably bad picture of what's going on with the index. That's
> another angle that you could approach the problem from, I suppose.

If we had a way of just preventing the page from being replaced, or reliably
detecting that happening, without blocking btree vacuum, the easiest path
seems to be to use the cached version of the root page, and re-do the work
whenever a) the LSN of the page has changed or b) the buffer has been
replaced. To me that seems like it'd likely be simpler and more general than
relying on being able to step right from any outdated, but not deleted,
version of the page (due to the page deletion issues).

Obviously that'd lead to retries more often - but realistically it's still
going to be vanishingly rare, root pages don't get modified that much once the
index is beyond toy size.


I think a replacement counter in the buffer desc is the easiest way to achieve
that?  We'd have to store the buffer ID, buffer replacement counter and page
LSN in RelationData.  I think the protocol would have to be something like

1) do search on the copy of the root page

2) get page LSN from the relevant buffer contents - this could be from a
   different relation / block or even an empty page, but will never be an
   invalid memory access, as we don't free shared buffers before shutdown.  If
   the LSN changed since taking the copy, take a new copy of the root page and
   start at 1)

3) check if buffer replacement counter is the same as at the time of the copy,
   if not take a new copy of the root page and start at 1)

4) happiness


For optimization reasons it might make sense to store the buffer replacement
counter on a separate cacheline from BufferDesc.{state,content_lock}, so
reading the buffer replacement counter doesn't cause cacheline contention with
backends working with state/lock. But that's an implementation detail, and it
might not matter much, because the pressure on state,content_lock would be
reduced drastically.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: PostgreSQL 16 Beta 2 release announcement draft
Next
From: Andres Freund
Date:
Subject: Re: Add GUC to tune glibc's malloc implementation.