Thread: pgsql: Optimize btree insertions for common case of increasing values

pgsql: Optimize btree insertions for common case of increasing values

From
Andrew Dunstan
Date:
Optimize btree insertions for common case of increasing values

Remember the last page of an index insert if it's the rightmost leaf
page. If the next entry belongs on and can fit in the remembered page,
insert the new entry there as long as we can get a lock on the page.
Otherwise, fall back on the more expensive method of searching for
the right place to insert the entry.

This provides a performance improvement for the common case where an
index entry is for monotonically increasing or nearly monotonically
increasing value such as an identity field or a current timestamp.

Pavan Deolasee
Reviewed by Claudio Freire, Simon Riggs and Peter Geoghegan

Discussion: https://postgr.es/m/CABOikdM9DrupjyKZZFM5k8-0RCDs1wk6JzEkg7UgSW6QzOwMZw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/2b27273435392d1606f0ffc95d73a439a457f08e

Modified Files
--------------
src/backend/access/nbtree/nbtinsert.c  | 138 +++++++++++++++---
src/test/regress/expected/indexing.out | 249 +++++++++++++++++++++++++++++++++
src/test/regress/sql/indexing.sql      | 116 +++++++++++++++
3 files changed, 484 insertions(+), 19 deletions(-)


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Mon, Mar 26, 2018 at 5:10 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Optimize btree insertions for common case of increasing values
>
> Remember the last page of an index insert if it's the rightmost leaf
> page. If the next entry belongs on and can fit in the remembered page,
> insert the new entry there as long as we can get a lock on the page.
> Otherwise, fall back on the more expensive method of searching for
> the right place to insert the entry.

This code can allocate memory in a critical section, since
RelationSetTargetBlock() can call smgropen():

            if (P_ISLEAF(lpageop))
+           {
                xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+               /*
+                * Cache the block information if we just inserted into the
+                * rightmost leaf page of the index.
+                */
+               if (P_RIGHTMOST(lpageop))
+                   RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+           }

rd_smgr can be closed by a relcache flush, so it is in fact possible
that we'll reach smgropen().

As I already pointed out, it's unclear why some of these tests are
used, such as P_INCOMPLETE_SPLIT():

+           /*
+            * Check if the page is still the rightmost leaf page, has enough
+            * free space to accommodate the new tuple, no split is in progress
+            * and the scankey is greater than or equal to the first key on the
+            * page.
+            */
+           if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+                   !P_INCOMPLETE_SPLIT(lpageop) &&
+                   !P_IGNORE(lpageop) &&
+                   (PageGetFreeSpace(page) > itemsz) &&
+                   PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+                   _bt_compare(rel, natts, itup_scankey, page,
+                       P_FIRSTDATAKEY(lpageop)) > 0)

The BTP_INCOMPLETE_SPLIT bit means that the page's right sibling's
downlink is missing, but how can a rightmost page have a right sibling
in the first place? We don't bother to check that when we already know
the page is rightmost within _bt_moveright() or _bt_findinsertloc().

I also noticed that favorable/favourable is misspelled "favourble".
Also, "workout" is a noun.

This patch was committed in haste, and it shows.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Pavan Deolasee
Date:


On Wed, Mar 28, 2018 at 11:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:


This code can allocate memory in a critical section, since
RelationSetTargetBlock() can call smgropen():

            if (P_ISLEAF(lpageop))
+           {
                xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+               /*
+                * Cache the block information if we just inserted into the
+                * rightmost leaf page of the index.
+                */
+               if (P_RIGHTMOST(lpageop))
+                   RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+           }

rd_smgr can be closed by a relcache flush, so it is in fact possible
that we'll reach smgropen().

I think you're right. I propose that we call RelationSetTargetBlock right after the critical section ends. That should not have any adverse side effect since we're still holding WRITE lock on the page, it's just an hint anyways and the next time we shall do a proper recheck. 
 

As I already pointed out, it's unclear why some of these tests are
used, such as P_INCOMPLETE_SPLIT():

+           /*
+            * Check if the page is still the rightmost leaf page, has enough
+            * free space to accommodate the new tuple, no split is in progress
+            * and the scankey is greater than or equal to the first key on the
+            * page.
+            */
+           if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+                   !P_INCOMPLETE_SPLIT(lpageop) &&
+                   !P_IGNORE(lpageop) &&
+                   (PageGetFreeSpace(page) > itemsz) &&
+                   PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+                   _bt_compare(rel, natts, itup_scankey, page,
+                       P_FIRSTDATAKEY(lpageop)) > 0)

The BTP_INCOMPLETE_SPLIT bit means that the page's right sibling's
downlink is missing, but how can a rightmost page have a right sibling
in the first place? We don't bother to check that when we already know
the page is rightmost within _bt_moveright() or _bt_findinsertloc().

Hmm. I agree. I am sorry, I missed that comment during the review process. I checked the code again and BTP_INCOMPLETE_SPLIT is always set along with the right-link. So the rightmost page, which does not have a right-link, must not have BTP_INCOMPLETE_SPLIT set. That test is thus redundant (though it should not give any wrong answers).

Previously, we agreed that P_IGNORE() is required. So I assume no issues there. The other tests seem required too for us to conclusively decide to use the cached block.
  

I also noticed that favorable/favourable is misspelled "favourble".
Also, "workout" is a noun.

Fixed in the attached delta.

Please have a look.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Wed, Mar 28, 2018 at 2:14 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>> This code can allocate memory in a critical section, since
>> RelationSetTargetBlock() can call smgropen():

> I think you're right. I propose that we call RelationSetTargetBlock right
> after the critical section ends. That should not have any adverse side
> effect since we're still holding WRITE lock on the page, it's just an hint
> anyways and the next time we shall do a proper recheck.

But why even do it with a write buffer lock held?

The principle behind this patch is that there can only be one
non-ignorable rightmost leaf page in the index, so we can always
detect it if our RelationSetTargetBlock() became stale; this makes it
okay that it can go stale at any time, and that we don't even have a
very weak interlock. That might be fine, but ISTM that calling
RelationSetTargetBlock() with any kind of lock held is misleading. The
hint can become invalidated immediately after the buffer lock is
released anyway, so what's the point?

>> As I already pointed out, it's unclear why some of these tests are
>> used, such as P_INCOMPLETE_SPLIT():

> Previously, we agreed that P_IGNORE() is required. So I assume no issues
> there. The other tests seem required too for us to conclusively decide to
> use the cached block.

Actually, you're right that P_INCOMPLETE_SPLIT() is the only redundant
one. Theoretically, P_IGNORE() should not be required, since page
deletion cannot delete the right most page among an internal page's
children, but actually omitting the P_IGNORE() doesn't seem like an
improvement. That's probably worth a comment, though.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Wed, Mar 28, 2018 at 11:56 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Previously, we agreed that P_IGNORE() is required. So I assume no issues
>> there. The other tests seem required too for us to conclusively decide to
>> use the cached block.
>
> Actually, you're right that P_INCOMPLETE_SPLIT() is the only redundant
> one.

Another issue that I have with the main test of the
RelationSetTargetBlock() page is that nobody explains *why* we check
that we have enough space on the page to proceed. Why would it not be
okay if there was a page split?

Although it's subtle, I'm pretty confident that it actually would be
okay from a correctness point of view to allow an insertion to go
ahead that would result in a page split -- it's possible to pass a
NULL stack to _bt_findinsertloc() and _bt_insertonpg() while still
getting correct behavior.

* In the case of _bt_findinsertloc(), it's okay to have a NULL stack
because you'll never take the P_INCOMPLETE_SPLIT() path for a
rightmost page, as already explained, so the stack cannot possibly be
used (besides, even _bt_finish_split() can deal with a NULL stack). It
is not just an accident that it works, because _bt_search() will
sometimes return a NULL stack when writing -- it does so when there is
only one non-meta page, which is both a root and a leaf page. Your
optimization is a bit similar to that, in the sense that the
cached/target block contains a leaf page that is rightmost (root pages
are always rightmost, and are leaf pages in the case described).

* In the case of _bt_insertonpg(), it's okay to have a NULL stack
because there is a codepath that allows _bt_insert_parent() (which is
the only thing in _bt_insertonpg() that touches the stack) to work
without a stack during crash recovery. That path is quite a bit
slower, though.

Suggested next steps to deal with this:

* A minor point: I don't think you should call
RelationSetTargetBlock() when the page P_ISROOT(), which, as I
mentioned, is a condition that can coexist with P_ISLEAF() with very
small B-Trees. There can be no point in doing so. No need to add
P_ISROOT() to the main "Is cached page stale?" test within
_bt_doinsert(), though.

* An assertion would make me feel a lot better about depending on not
having a page split from a significant distance.

Your optimization should continue to not be used when it would result
in a page split, but only because that would be slow. The comments
should say so, IMV. Also, _bt_insertonpg() should have an assertion
against a page split actually occurring when the optimization was
used, just in case. When !BufferIsValid(cbuf), we know that we're
being called from _bt_doinsert() (see existing assertion at the top of
_bt_insertonpg() as an example of this), so at the point where it's
clear a page split is needed, we should assert that there is no target
block that we must have been passed as the target page.

* The indentation of the main _bt_doinsert() test does not follow
project guidelines. Please tweak that, too.

That's all I have for now. I might think of something else tomorrow,
since I haven't spent that much time on this. It would be nice to get
a second opinion on some of this, if at all possible.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Simon Riggs
Date:
On 29 March 2018 at 00:09, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Mar 28, 2018 at 11:56 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> Previously, we agreed that P_IGNORE() is required. So I assume no issues
>>> there. The other tests seem required too for us to conclusively decide to
>>> use the cached block.
>>
>> Actually, you're right that P_INCOMPLETE_SPLIT() is the only redundant
>> one.
>
> Another issue that I have with the main test of the
> RelationSetTargetBlock() page is that nobody explains *why* we check
> that we have enough space on the page to proceed. Why would it not be
> okay if there was a page split?

...because we need the Stack to bubble up the parent insert.

> Although it's subtle, I'm pretty confident that it actually would be
> okay from a correctness point of view to allow an insertion to go
> ahead that would result in a page split -- it's possible to pass a
> NULL stack to _bt_findinsertloc() and _bt_insertonpg() while still
> getting correct behavior.
>
> * In the case of _bt_findinsertloc(), it's okay to have a NULL stack
> because you'll never take the P_INCOMPLETE_SPLIT() path for a
> rightmost page, as already explained, so the stack cannot possibly be
> used (besides, even _bt_finish_split() can deal with a NULL stack). It
> is not just an accident that it works, because _bt_search() will
> sometimes return a NULL stack when writing -- it does so when there is
> only one non-meta page, which is both a root and a leaf page. Your
> optimization is a bit similar to that, in the sense that the
> cached/target block contains a leaf page that is rightmost (root pages
> are always rightmost, and are leaf pages in the case described).
>
> * In the case of _bt_insertonpg(), it's okay to have a NULL stack
> because there is a codepath that allows _bt_insert_parent() (which is
> the only thing in _bt_insertonpg() that touches the stack) to work
> without a stack during crash recovery. That path is quite a bit
> slower, though.

It's a linear scan, so quite clearly not going to perform well on big indexes.

Why would that be worth pursuing?

> Suggested next steps to deal with this:
>
> * A minor point: I don't think you should call
> RelationSetTargetBlock() when the page P_ISROOT(), which, as I
> mentioned, is a condition that can coexist with P_ISLEAF() with very
> small B-Trees. There can be no point in doing so. No need to add
> P_ISROOT() to the main "Is cached page stale?" test within
> _bt_doinsert(), though.

True. A better test would be to not use the optimization at all on
smaller btrees by checking the level is >= 3.

> * An assertion would make me feel a lot better about depending on not
> having a page split from a significant distance.
>
> Your optimization should continue to not be used when it would result
> in a page split, but only because that would be slow. The comments
> should say so, IMV. Also, _bt_insertonpg() should have an assertion
> against a page split actually occurring when the optimization was
> used, just in case. When !BufferIsValid(cbuf), we know that we're
> being called from _bt_doinsert() (see existing assertion at the top of
> _bt_insertonpg() as an example of this), so at the point where it's
> clear a page split is needed, we should assert that there is no target
> block that we must have been passed as the target page.

This is the same issues as the NULL stack. The page split would need
to insert into parent.

> * The indentation of the main _bt_doinsert() test does not follow
> project guidelines. Please tweak that, too.
>
> That's all I have for now. I might think of something else tomorrow,
> since I haven't spent that much time on this. It would be nice to get
> a second opinion on some of this, if at all possible.

I don't see any need to change any of these things. This is a tuning
patch and none of the above affects correctness of what has been
committed.

If you can find non-performant cases that require further tuning, we
can submit another patch at a later time.

And then we can cross the bridge of checking whether what is said
above is correct or not.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Thu, Mar 29, 2018 at 7:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Another issue that I have with the main test of the
>> RelationSetTargetBlock() page is that nobody explains *why* we check
>> that we have enough space on the page to proceed. Why would it not be
>> okay if there was a page split?
>
> ...because we need the Stack to bubble up the parent insert.

As I said, strictly speaking we do not.

>> Although it's subtle, I'm pretty confident that it actually would be
>> okay from a correctness point of view to allow an insertion to go
>> ahead that would result in a page split -- it's possible to pass a
>> NULL stack to _bt_findinsertloc() and _bt_insertonpg() while still
>> getting correct behavior.

> It's a linear scan, so quite clearly not going to perform well on big indexes.
>
> Why would that be worth pursuing?

I must have been unclear. I agree that it isn't worth pursuing. My
point is that the fact that that actually works (e.g. it won't
segfault) could mask bugs where the "we must avoid a pagesplit" logic
breaks in the future. Relying on that working from a distance is
something that we can and should verify works out, with an assertion.

>> Suggested next steps to deal with this:
>>
>> * A minor point: I don't think you should call
>> RelationSetTargetBlock() when the page P_ISROOT(), which, as I
>> mentioned, is a condition that can coexist with P_ISLEAF() with very
>> small B-Trees. There can be no point in doing so. No need to add
>> P_ISROOT() to the main "Is cached page stale?" test within
>> _bt_doinsert(), though.
>
> True. A better test would be to not use the optimization at all on
> smaller btrees by checking the level is >= 3.

That might be a better idea, though I would probably go with >= 2. You
can have a reasonably big B-Tree with only one layer of internal pages
between the root and the leaf level. Though that's not worth quibbling
about.

>> * An assertion would make me feel a lot better about depending on not
>> having a page split from a significant distance.
>>
>> Your optimization should continue to not be used when it would result
>> in a page split, but only because that would be slow. The comments
>> should say so, IMV. Also, _bt_insertonpg() should have an assertion
>> against a page split actually occurring when the optimization was
>> used, just in case. When !BufferIsValid(cbuf), we know that we're
>> being called from _bt_doinsert() (see existing assertion at the top of
>> _bt_insertonpg() as an example of this), so at the point where it's
>> clear a page split is needed, we should assert that there is no target
>> block that we must have been passed as the target page.
>
> This is the same issues as the NULL stack. The page split would need
> to insert into parent.

But, to repeat myself, it actually can do that. Try removing the space
check in the main _bt_doinsert() test for yourself -- the regression
tests continue to pass. I would like there to be an assertion failure
instead.

> I don't see any need to change any of these things. This is a tuning
> patch and none of the above affects correctness of what has been
> committed.

I think that there has been a misunderstanding. I'm only asking for a
new assertion when I talk about the stack. That's all.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Pavan Deolasee
Date:


On Thu, Mar 29, 2018 at 4:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:


Suggested next steps to deal with this:

* A minor point: I don't think you should call
RelationSetTargetBlock() when the page P_ISROOT(), which, as I
mentioned, is a condition that can coexist with P_ISLEAF() with very
small B-Trees. There can be no point in doing so. No need to add
P_ISROOT() to the main "Is cached page stale?" test within
_bt_doinsert(), though.

Ok. Adding a check for tree height and setting target block only if it's >= 2, as suggested by you and Simon later. Rahila helped me also ran another round of tests and this does not lead to any performance regression (we were worried about whether calling _bt_getrootheight will be expensive).

Also moved RelationSetTargetBlock() to a point where we are not holding any locks and are outside the critical section.
 

* An assertion would make me feel a lot better about depending on not
having a page split from a significant distance.

Ok. I assume you mean an assertion to check that the target page doesn't have an in-complete split. Added that though not sure if it's useful since we concluded that right-most page can't have in-complete split.
 
Let me know if you mean something else.
 
Your optimization should continue to not be used when it would result
in a page split, but only because that would be slow. The comments
should say so, IMV.

Added.
 
Also, _bt_insertonpg() should have an assertion
against a page split actually occurring when the optimization was
used, just in case. When !BufferIsValid(cbuf), we know that we're
being called from _bt_doinsert() (see existing assertion at the top of
_bt_insertonpg() as an example of this), so at the point where it's
clear a page split is needed, we should assert that there is no target
block that we must have been passed as the target page.


You mean passing "fastpath" to _bt_insertonpg and then checking it's false if page split is needed? But isn't page split is only needed if the page doesn't have enough free space? If so, we have checked for that before setting "fastpath".
 
* The indentation of the main _bt_doinsert() test does not follow
project guidelines. Please tweak that, too.

Ok. Fixed.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Wed, Apr 4, 2018 at 5:33 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> Ok. Adding a check for tree height and setting target block only if it's >=
> 2, as suggested by you and Simon later. Rahila helped me also ran another
> round of tests and this does not lead to any performance regression (we were
> worried about whether calling _bt_getrootheight will be expensive).

Right.

> Also moved RelationSetTargetBlock() to a point where we are not holding any
> locks and are outside the critical section.

Right.

>> * An assertion would make me feel a lot better about depending on not
>> having a page split from a significant distance.
>
>
> Ok. I assume you mean an assertion to check that the target page doesn't
> have an in-complete split. Added that though not sure if it's useful since
> we concluded that right-most page can't have in-complete split.
>
> Let me know if you mean something else.

I meant something else. I was talking about the assertion discussed
below. I don't see too much point in the !P_INCOMPLETE_SPLIT(lpageop)
assertion, though.

>> Your optimization should continue to not be used when it would result
>> in a page split, but only because that would be slow. The comments
>> should say so, IMV.
>
>
> Added.

I think the wording here could use some tweaking:

>             /*
> -            * Check if the page is still the rightmost leaf page, has enough
> -            * free space to accommodate the new tuple, no split is in progress
> -            * and the scankey is greater than or equal to the first key on the
> -            * page.
> +            * Check if the page is still the rightmost valid leaf page, has
> +            * enough free space to accommodate the new tuple and the scankey
> +            * is strictly greater than the first key on the page.
> +            *
> +            * NB: We could take the fastpath even when the target block
> +            * doesn't have enough free space (but it's the right-most block)
> +            * since _bt_insertonpg() is capable of working with a NULL stack
> +            * and that's the only additional thing the slow path sets up. But
> +            * we don't optimise for that case because while spliting and
> +            * inserting into the parent without the stack is relatively slow
> +            * operation.
>              */

I would cut this down, and just say "We only insert if it definitely
won't result in a pagesplit" -- no need for the second paragraph in
this high-level routine. The full details can go on top of the new
_bt_insertonpg() assertion I talk about later.

> You mean passing "fastpath" to _bt_insertonpg and then checking it's false
> if page split is needed? But isn't page split is only needed if the page
> doesn't have enough free space? If so, we have checked for that before
> setting "fastpath".

That's not exactly what I meant. I meant that if:

1. This is an insertion to the leaf level in _bt_insertonpg().

and

2. We don't have space on the page, and so must do a split (or at
least free LP_DEAD items).

and

3. RelationGetTargetBlock(rel) != InvalidBlockNumber

There should be an assertion failure. This new assertion within
_bt_insertonpg() makes it much less likely that the assumption breaks
later.

This is where you could point out the low-level details that I
suggested be omitted from _bt_doinsert() at the beginning of this
e-mail. You can mention here that it would actually work without a
pagesplit, but that is only intended for crash recovery, and is a much
slower path that would make the optimization totally
counter-productive. We add an assertion because without one it would
be easy to miss a regression where there is a page split with an empty
stack.

Finally, I'd like to see a small paragraph in the nbtree README, about
the high level theory behind this optimization and page recycling. The
assumption is that there can only be one non-ignorable leaf rightmost
page, and so even a RecentGlobalXmin style interlock isn't required.
We cannot fail to detect that our hint was invalidated, because there
can only be one such page in the B-Tree at any time. It's possible
that the page will be deleted and recycled without a backend's cached
page also being detected as invalidated, but only when we happen to
recycle a page that once again becomes the rightmost leaf page.

Once those changes are made, this should be fine to commit.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Pavan Deolasee
Date:


On Wed, Apr 4, 2018 at 10:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:


> You mean passing "fastpath" to _bt_insertonpg and then checking it's false
> if page split is needed? But isn't page split is only needed if the page
> doesn't have enough free space? If so, we have checked for that before
> setting "fastpath".

That's not exactly what I meant. I meant that if:

1. This is an insertion to the leaf level in _bt_insertonpg().

and

2. We don't have space on the page, and so must do a split (or at
least free LP_DEAD items).

and

3. RelationGetTargetBlock(rel) != InvalidBlockNumber

There should be an assertion failure. This new assertion within
_bt_insertonpg() makes it much less likely that the assumption breaks
later.

I came up with something like this:

+               /*
+                * If we're here then a pagesplit is needed. We should never reach here
+                * if we're using the fastpath since we should have checked for all the
+                * required conditions, including the fact that this page has enough
+                * freespace. Note that this routine can in-theory deal with the
+                * situation where a NULL stack pointer is passed (that's what would
+                * happen if the fastpath is taken), like it does during crash
+                * recovery. But that path is much slower, defeating the very purpose
+                * of the optimisation.  The following assertion should protect us from
+                * any future code changes that invalidate those assumptions.
+                *
+                * Note the fact that whenever we fail to take the fastpath, we clear
+                * the cached block. So checking for a valid cached block at this point
+                * is enough to decide whether we're in a fastpath or not.
+                */
+               Assert(!(P_ISLEAF(lpageop) &&
+                               BlockNumberIsValid(RelationGetTargetBlock(rel))));
+
 
But then I started thinking can _bt_insertonpg() be called from a code path which doesn't start at _bt_doinsert()? I traced one such path _bt_getstackbuf() -> _bt_finish_split() -> _bt_insert_parent(), but that can only happen in case of a non-leaf page. The assertion which checks for a leaf page, should be fine, right?
 

Finally, I'd like to see a small paragraph in the nbtree README, about
the high level theory behind this optimization and page recycling. The
assumption is that there can only be one non-ignorable leaf rightmost
page, and so even a RecentGlobalXmin style interlock isn't required.
We cannot fail to detect that our hint was invalidated, because there
can only be one such page in the B-Tree at any time. It's possible
that the page will be deleted and recycled without a backend's cached
page also being detected as invalidated, but only when we happen to
recycle a page that once again becomes the rightmost leaf page.


Can I borrow that almost verbatim, after adding details about the optimisation itself? Looks like I'm too tired to think sensibly.
 
Once those changes are made, this should be fine to commit.

Ok, thanks.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Thu, Apr 5, 2018 at 6:27 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I came up with something like this:
>
> +               /*
> +                * If we're here then a pagesplit is needed. We should never
> reach here
> +                * if we're using the fastpath since we should have checked
> for all the
> +                * required conditions, including the fact that this page
> has enough
> +                * freespace. Note that this routine can in-theory deal with
> the
> +                * situation where a NULL stack pointer is passed (that's
> what would
> +                * happen if the fastpath is taken), like it does during
> crash
> +                * recovery. But that path is much slower, defeating the
> very purpose
> +                * of the optimisation.  The following assertion should
> protect us from
> +                * any future code changes that invalidate those
> assumptions.
> +                *
> +                * Note the fact that whenever we fail to take the fastpath,
> we clear
> +                * the cached block. So checking for a valid cached block at
> this point
> +                * is enough to decide whether we're in a fastpath or not.
> +                */
> +               Assert(!(P_ISLEAF(lpageop) &&
> +
> BlockNumberIsValid(RelationGetTargetBlock(rel))));
> +

This looks good to me.

> But then I started thinking can _bt_insertonpg() be called from a code path
> which doesn't start at _bt_doinsert()? I traced one such path
> _bt_getstackbuf() -> _bt_finish_split() -> _bt_insert_parent(), but that can
> only happen in case of a non-leaf page. The assertion which checks for a
> leaf page, should be fine, right?

Right. That's the whole reason for the P_ISLEAF() part of the
assertion. Actually, since a page in an internal page can only happen
as a direct consequence of a page split in one of its children, you
don't even need the P_ISLEAF() part. I'd leave it in, though, since it
makes it clearer which caller you're thinking about.

>> Finally, I'd like to see a small paragraph in the nbtree README, about
>> the high level theory behind this optimization and page recycling. The
>> assumption is that there can only be one non-ignorable leaf rightmost
>> page, and so even a RecentGlobalXmin style interlock isn't required.
>> We cannot fail to detect that our hint was invalidated, because there
>> can only be one such page in the B-Tree at any time. It's possible
>> that the page will be deleted and recycled without a backend's cached
>> page also being detected as invalidated, but only when we happen to
>> recycle a page that once again becomes the rightmost leaf page.
>>
>
> Can I borrow that almost verbatim, after adding details about the
> optimisation itself? Looks like I'm too tired to think sensibly.

I know the feeling.

I think you can take that wording almost verbatim. Obviously it should
refer to the optimization by name, and blend into the surrounding text
in the README. I suggest putting a small section before "On-the-Fly
Deletion Of Index Tuples", but after the main discussion of deletion +
recycling. It's essentially an exception to the general rule, so that
placement makes sense to me.

Thanks
-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Thu, Apr 5, 2018 at 10:16 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think you can take that wording almost verbatim. Obviously it should
> refer to the optimization by name, and blend into the surrounding text
> in the README. I suggest putting a small section before "On-the-Fly
> Deletion Of Index Tuples", but after the main discussion of deletion +
> recycling. It's essentially an exception to the general rule, so that
> placement makes sense to me.

I also think that we should also say something about extent-based
storage. This optimization relies on the assumption that reading some
stale block cannot read a block from some other relation (which could
perhaps be its own rightmost leaf page). If we ever wanted to share
storage between small relations as extents, that would invalidate the
optimization.

This came up recently on the "PostgreSQL's handling of fsync() errors
is unsafe and risks data loss at least on XFS" thread, and what I
describe is in fact how other database systems manage storage, so this
seems like a real practical consideration.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Pavan Deolasee
Date:


On Tue, Apr 10, 2018 at 1:18 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 5, 2018 at 10:16 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think you can take that wording almost verbatim. Obviously it should
> refer to the optimization by name, and blend into the surrounding text
> in the README. I suggest putting a small section before "On-the-Fly
> Deletion Of Index Tuples", but after the main discussion of deletion +
> recycling. It's essentially an exception to the general rule, so that
> placement makes sense to me.

I also think that we should also say something about extent-based
storage. This optimization relies on the assumption that reading some
stale block cannot read a block from some other relation (which could
perhaps be its own rightmost leaf page). If we ever wanted to share
storage between small relations as extents, that would invalidate the
optimization.

This came up recently on the "PostgreSQL's handling of fsync() errors
is unsafe and risks data loss at least on XFS" thread, and what I
describe is in fact how other database systems manage storage, so this
seems like a real practical consideration.


Hmm. I am a bit confused why we want to mention anything about something we're not even considering seriously, let alone any patch or work in that direction. If we at all change everything to use extent based storage, there would be many other things that will break and require changes, no?

 Apart from that, other changes requested are included in the patch. This also takes care of Heikki's observation regarding UNLOGGED tables on the other thread.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 12:07 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Hmm. I am a bit confused why we want to mention anything about something
> we're not even considering seriously, let alone any patch or work in that
> direction. If we at all change everything to use extent based storage, there
> would be many other things that will break and require changes, no?

I guess you're right. It's implied by what you already say in the
nbtree README. No need to do more.

>  Apart from that, other changes requested are included in the patch. This
> also takes care of Heikki's observation regarding UNLOGGED tables on the
> other thread.

Cool.

Feedback on this version:

> +without a backend's cached page also being detected as invalidated, but
> +only when we happen to recycle a page that once again becomes the
> +rightmost leaf page.

I suggest wording this as "...but only when we happen to recycle a
block that once again gets recycled as the rightmost leaf page". (I'm
correcting my own wording here, I think.)

* No need to start a sentence with "So" here, IMV:

> +        * Note the fact that whenever we fail to take the fastpath, we clear
> +        * the cached block. So checking for a valid cached block at this point
> +        * is enough to decide whether we're in a fastpath or not.

* This should test "if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) &&
!P_ISROOT(lpageop))", because you don't want to use the optimization
for internal pages that happen to not be the root -- there is no leaf
check here:

> +       /*
> +        * Cache the block information if we just inserted into the rightmost
> +        * leaf page of the index and it's not the root page.  For very small
> +        * index where root is also the leaf, there is no point trying for any
> +        * optimisation.
> +        */
> +       if (P_RIGHTMOST(lpageop) && !P_ISROOT(lpageop))
> +           cachedBlock = BufferGetBlockNumber(buf);

* I don't think that you should have a #define inline with code like this:

> +#define BTREE_FASTPATH_MIN_LEVEL   2
> +       if (BlockNumberIsValid(cachedBlock) &&
> +           _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
> +           RelationSetTargetBlock(rel, cachedBlock);

I suggest putting this in nbtree.h instead. You can put it just after
BTREE_NONLEAF_FILLFACTOR, with a comment, in a separate block/section.

Other than that, looks good to me.

Thanks
-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 12:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>  Apart from that, other changes requested are included in the patch. This
>> also takes care of Heikki's observation regarding UNLOGGED tables on the
>> other thread.
>
> Cool.

BTW, Heikki's commit didn't remove the "no split is in progress" part
of the top comment. I think that you can remove that now, but leave in
the separate P_INCOMPLETE_SPLIT() assertion that you've proposed.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasingvalues

From
Andrew Dunstan
Date:

On 04/10/2018 03:30 PM, Peter Geoghegan wrote:
> On Tue, Apr 10, 2018 at 12:07 PM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>> Hmm. I am a bit confused why we want to mention anything about something
>> we're not even considering seriously, let alone any patch or work in that
>> direction. If we at all change everything to use extent based storage, there
>> would be many other things that will break and require changes, no?
> I guess you're right. It's implied by what you already say in the
> nbtree README. No need to do more.
>
>>  Apart from that, other changes requested are included in the patch. This
>> also takes care of Heikki's observation regarding UNLOGGED tables on the
>> other thread.
> Cool.
>
> Feedback on this version:
>
>> +without a backend's cached page also being detected as invalidated, but
>> +only when we happen to recycle a page that once again becomes the
>> +rightmost leaf page.
> I suggest wording this as "...but only when we happen to recycle a
> block that once again gets recycled as the rightmost leaf page". (I'm
> correcting my own wording here, I think.)
>
> * No need to start a sentence with "So" here, IMV:
>
>> +        * Note the fact that whenever we fail to take the fastpath, we clear
>> +        * the cached block. So checking for a valid cached block at this point
>> +        * is enough to decide whether we're in a fastpath or not.
> * This should test "if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) &&
> !P_ISROOT(lpageop))", because you don't want to use the optimization
> for internal pages that happen to not be the root -- there is no leaf
> check here:
>
>> +       /*
>> +        * Cache the block information if we just inserted into the rightmost
>> +        * leaf page of the index and it's not the root page.  For very small
>> +        * index where root is also the leaf, there is no point trying for any
>> +        * optimisation.
>> +        */
>> +       if (P_RIGHTMOST(lpageop) && !P_ISROOT(lpageop))
>> +           cachedBlock = BufferGetBlockNumber(buf);
> * I don't think that you should have a #define inline with code like this:
>
>> +#define BTREE_FASTPATH_MIN_LEVEL   2
>> +       if (BlockNumberIsValid(cachedBlock) &&
>> +           _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
>> +           RelationSetTargetBlock(rel, cachedBlock);
> I suggest putting this in nbtree.h instead. You can put it just after
> BTREE_NONLEAF_FILLFACTOR, with a comment, in a separate block/section.
>
> Other than that, looks good to me.
>


Committed with light editing. I didn't put the #define in the .h file -
it's only used here and that seemed like unnecessary clutter. I moved it
to the top of the file. I also standardized the spelling of "optimization".

cheers

andrew



Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 3:27 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Committed with light editing. I didn't put the #define in the .h file -
> it's only used here and that seemed like unnecessary clutter. I moved it
> to the top of the file. I also standardized the spelling of "optimization".

The comments still say "Check if the page...no split is in progress".
Despite the fact that that's just an assertion now.

-- 
Peter Geoghegan


Re: pgsql: Optimize btree insertions for common case of increasingvalues

From
Andrew Dunstan
Date:

On 04/10/2018 06:33 PM, Peter Geoghegan wrote:
> On Tue, Apr 10, 2018 at 3:27 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> Committed with light editing. I didn't put the #define in the .h file -
>> it's only used here and that seemed like unnecessary clutter. I moved it
>> to the top of the file. I also standardized the spelling of "optimization".
> The comments still say "Check if the page...no split is in progress".
> Despite the fact that that's just an assertion now.
>

Fixed.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Optimize btree insertions for common case of increasing values

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 3:37 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>> The comments still say "Check if the page...no split is in progress".
>> Despite the fact that that's just an assertion now.
>>
>
> Fixed.

Thanks.

-- 
Peter Geoghegan