Re: Faster inserts with mostly-monotonically increasing values - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Faster inserts with mostly-monotonically increasing values
Date
Msg-id CAGTBQpahfVtS_UaZMXdx82FPRKhtB_LLUTvV5qoKL55OWcRmzA@mail.gmail.com
Whole thread Raw
In response to Re: Faster inserts with mostly-monotonically increasing values  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Faster inserts with mostly-monotonically increasing values  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Mon, Mar 5, 2018 at 9:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Correct me if I'm wrong, but there's lots of code in nbtree already
>> that risks reading recycled pages for various reasons. Presumably,
>> checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
>> !P_IGNORE implies.
>
> You're mistaken. Read the nbtree README on page deletion, and the
> RecentGlobalXmin interlock. Note also that there are 3 distinct page
> states in play as far as deletion/recycling is concerned:
>
> 1. The first phase of deletion
> 2. The second phase of deletion.
> 3. Post-recycling, when in general the page could look like just about anything.
>
> It's page deletion's job to make sure that an index scan cannot land
> on a page after it was recycled. The corollary is that it is not
> everyone else's job to make sure that that didn't happen -- how could
> that possibly work?

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

For the third phase:

> A deleted page can only be reclaimed once there is no scan or search that
> has a reference to it; until then, it must stay in place with its
> right-link undisturbed.

That's because:

> A deleted page cannot be reclaimed immediately, since there may be other
> processes waiting to reference it (ie, search processes that just left the
> parent, or scans moving right or left from one of the siblings).  These
> processes must observe that the page is marked dead and recover
> accordingly.  Searches and forward scans simply follow the right-link
> until they find a non-dead page --- this will be where the deleted page's
> key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

>>>> You could allow for some slack, by comparing against the leftmost key
>>>> instead. You just need to know whether the key fits in the page. Then,
>>>> the insert can proceed as usual, doing the binsearch to find the
>>>> proper insertion point, etc.
>>>
>>> The whole way that this patch opts out of using an exclusive buffer
>>> lock on the "first page the value could be on" (the _bt_check_unique()
>>> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
>>> terribly concrete.
>>
>> Assuming the rightmost page is the first page the value could be on,
>> it already does get an exclusive buffer lock.
>
> This can be quite fiddly. The downlink in the parent can be equal to
> the value in the target page, meaning that we actually lock the
> target's left sibling (see _bt_binsrch() special case for internal
> pages).

I'm not sure which special case you're referring to, I don't see
anything relevant in _bt_binsrch.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

> I didn't say that the patch is necessarily wrong here. Just that it
> makes me nervous.

Any change to btree would make anyone nervous ;-)

>> If one wanted to relax the condition as I proposed, that uniqueness
>> check is necessary, so that's why I propose shortcircuiting the search
>> only, and not the actual insertion on the page. I believe IMO it's
>> more important to try and maximize the conditions under which the
>> optimization can be applied.
>
> I didn't get what the point of checking the first item on the page
> (your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Consider 3 transactions on a table like:

CREATE TABLE test ( id serial, d integer );

T1: INSERT INTO test (d) VALUES (3), (4), (5), (6);
T2: INSERT INTO test (d) SELECT generate_series(1,10);
T3: INSERT INTO test (d) VALUES (1);

Each transaction will take values from the sequence at some point of
the evaluation, far from actual insertion. This could cause the
following insertions:

  -- (id, d)
T1: (5, 3)
T2: (10, 1)
T2: (11, 2)
T1: (6, 4)
T3: (1, 1)
... etc

No insert would be optimized in this case, and I don't think it's an
uncommon case.

So what I propose is that the optimization only needs to check whether
"the rightmost page" is the right page to insert on. And for that
check, all you have to assert is that:

* The page is actually the currently valid rightmost leaf
* The **first** (data) key on that page is strictly lesser than the scan key

If both are true, the insertion point will be somewhere on that page.
Maybe not at the end, maybe at the middle, but somewhere in that page,
no need to go looking any further in the b-tree.

If the page is empty, the optimization can't be applied, simple as
that. That check is missing on the patch, true enough.


pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: PATCH: Configurable file mode mask
Next
From: David Steele
Date:
Subject: Re: 2018-03 CFM