Re: Faster inserts with mostly-monotonically increasing values - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Faster inserts with mostly-monotonically increasing values
Date
Msg-id CAH2-Wz=5eZr0+h8wZozpHEJkoYS-T05qW_Z00JWRh3md3ofGCA@mail.gmail.com
Whole thread Raw
In response to Re: Faster inserts with mostly-monotonically increasing values  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: Faster inserts with mostly-monotonically increasing values
Re: Faster inserts with mostly-monotonically increasing values
List pgsql-hackers
On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

True.

> For the third phase:
>
>> A deleted page can only be reclaimed once there is no scan or search that
>> has a reference to it; until then, it must stay in place with its
>> right-link undisturbed.
>
> That's because:
>
>> A deleted page cannot be reclaimed immediately, since there may be other
>> processes waiting to reference it (ie, search processes that just left the
>> parent, or scans moving right or left from one of the siblings).  These
>> processes must observe that the page is marked dead and recover
>> accordingly.  Searches and forward scans simply follow the right-link
>> until they find a non-dead page --- this will be where the deleted page's
>> key-space moved to.
>
> But in thise case there's no right link to follow, so it's a non-issue.
>
> BTree doesn't truncate AFAIK, so the cached block number can't be
> nonexisting (beyond the end of the relation). It's probably fragile to
> assume BTree will never truncate, so maybe it would be wise to check
> for that. But if the block exists, I believe it contains a page that
> can be safely interpreted by that condition. As long as there can be
> no other pages with the same flags on the index while the cached page
> is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

> Yes, if the scankey is equal to the leftmost key, the first occurrence
> of that key could be on the left, so the check would have to be for
> strictly greater. But that's about as complex as it would get.

That's pretty complicated.

>> I didn't say that the patch is necessarily wrong here. Just that it
>> makes me nervous.
>
> Any change to btree would make anyone nervous ;-)

True. Anyway, this isn't the thing that I'm most concerned about right now.

>> I didn't get what the point of checking the first item on the page
>> (your proposal) was.
>
> Right now, if you've got concurrent transactions, there's absolutely
> no chance this optimization would kick in. For it to happen,
> insertions on the index need to happen in order, each insertion a key
> greater than any other key (visible or not) on the index.
>
> Concurrent inserts on PKs are unlikely to arrive in order, a slight
> disorder is to be expected as serial sequence numbers get generated a
> significant amount of time before the insert actually takes place.
> Different backends can get to the index insertion at different times,
> causing unordered inserts.
>
> I believe PKs are a prime candidate for this optimization, and
> expecting it to apply only when no concurrency is involved is severely
> dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #14941: Vacuum crashes
Next
From: Michael Paquier
Date:
Subject: Re: Kerberos test suite