Thread: Race condition within _bt_findinsertloc()? (new page split code)

Race condition within _bt_findinsertloc()? (new page split code)

From
Peter Geoghegan
Date:
While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/** If this page was incompletely split, finish the split now. We* do this while holding a lock on the left sibling,
whichis not* good because finishing the split could be a fairly lengthy* operation.  But this should happen very
seldom.*/
if (P_INCOMPLETE_SPLIT(lpageop))
{   _bt_finish_split(rel, rbuf, stack);   rbuf = InvalidBuffer;   continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement. It's the
first sibling iff we're on our first iteration of the nested for(;;)
loop in _bt_findinsertloc(). So the buffer lock held on this left
sibling may constitute what in the past I've called a "value lock";
we've established the right to insert our value into the unique index
at this point, and the lock will only be released when we're done
(regardless of whether or not that buffer/page value lock is on the
buffer/page we'll ultimately insert into, or an earlier one).

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this). It exclusive locks that later page/buffer before the second guy
gets a chance to lock it once again. It establishes the right to
insert with _bt_check_unique(), undeterred by the second inserter's
buffer lock/"value lock". The value lock is effectively skipped over.

* Both the second and third inserters have "established the right" to
insert the same value, 7, and both do so. The unique index has an
MVCC-snapshot-wise spurious duplicate, and so is corrupt.

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

Thoughts?
-- 
Peter Geoghegan



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Heikki Linnakangas
Date:
On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
> While speccing out a new B-Tree verification tool, I had the
> opportunity to revisit a thought I had during review concerning
> _bt_findinsertloc(): that the new coding is unsafe in the event of
> deferred split completion of a leaf page of a unique index. To recap,
> we now do the following in _bt_findinsertloc():
>
> /*
>   * If this page was incompletely split, finish the split now. We
>   * do this while holding a lock on the left sibling, which is not
>   * good because finishing the split could be a fairly lengthy
>   * operation.  But this should happen very seldom.
>   */
> if (P_INCOMPLETE_SPLIT(lpageop))
> {
>      _bt_finish_split(rel, rbuf, stack);
>      rbuf = InvalidBuffer;
>      continue;
> }
>
> The "left sibling" referred to here is "the first page this key could
> be on", an important concept for unique index enforcement.

No, it's not "the first page this key could be on".

_bt_findinsertloc() does *not* hold a lock on the first valid page the 
key could go to. It merely ensures that when it steps to the next page, 
it releases the lock on the previous page only after acquiring the lock 
on the next page. Throughout the operation, it will hold a lock on 
*some* page that could legally hold the inserted value, and it acquires 
the locks in left-to-right order. This is sufficient for the uniqueness 
checks, because _bt_unique_check() scans all the pages, and 
_bt_unique_check() *does* hold the first page locked while it scans the 
rest of the pages. But _bt_findinsertlock() does not.

Also note that _bt_moveright() also finishes any incomplete splits it 
encounters (when called for an insertion). So _bt_findinsertloc() never 
gets called on a page with the incomplete-split flag set. It might 
encounter one when it moves right, but never the first page.

> Anyway, the concern is that there may be problems when we call
> _bt_finish_split() with that left sibling locked thoughout (i.e.
> finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
> itself has a right sibling from the incomplete split (which is usually
> the value lock page's right-right sibling)). I'm not concerned about
> performance, since as the comment says it ought to be an infrequent
> occurrence. I also believe that there are no deadlock hazards. But
> consider this scenario:
>
> * We insert the value 7 into an int4 unique index. We need to split
> the leaf page. We run out of memory or something, and ours is an
> incomplete page split. Our transaction is aborted. For the sake of
> argument, suppose that there are also already a bunch of dead tuples
> within the index with values of 7, 8 and 9.

If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-------------+
|             |
| 9 -> A      |
+-------------+

Leaf A
+-------------+
| HI-key: 9   |
|             |
| 7 8 9       |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
|             |
| 9 -> A      |
+-------------+

Leaf A              Leaf B
+--------------+     +-------------+
| HI-key: 8    |     | HI-key: 9   |
| (INCOMPLETE_ |     |             |
| SPLIT)       | <-> |             |
|              |     |             |
|  7   7*   8  |     |   9         |
+--------------+     +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next 
paragraph you said the new downlink has value 8, which implies the above 
split)

> * Another inserter of the value 7 comes along. It follows exactly the
> same downlink as the first, now aborted inserter (suppose the
> downlink's value is 9). It also locks the same leaf page to establish
> a "value lock" in precisely the same manner. Finding no room on the
> first page, it looks further right, maintaining its original "value
> lock" throughout. It finishes the first inserter's split of the
> non-value-lock page - a new downlink is inserted into the parent page,
> with the value 8. It then releases all buffer locks except the first
> one/original "value lock". A physical insertion has yet to occur.

Hmm, I think you got confused at this step. When inserting a 7, you 
cannot "look further right" to find a page with more space, because the 
HI-key, 8, on the first page stipulates that 7 must go on that page, not 
some later page.

> * A third inserter of the value comes along. It gets to a later page
> than the one locked by the second inserter, preferring the newer
> downlink with value 8 (the internal-page _bt_binsrch() logic ensures
> this).

That's a contradiction: the downlink with value 8 points to the first 
page, not some later page. After the split is finished, the tree looks 
like this:

Parent page
+-------------+
| 8 -> A      |
| 9 -> B      |
+-------------+

Leaf A              Leaf B
+-------------+     +-------------+
| HI-key: 8   |     | HI-key: 9   |
|             | <-> |             |
|  7   7*  8  |     |   9         |
+-------------+     +-------------+

> Regardless of whether or not I have these details exactly right (that
> is, regardless of whether or not this scenario is strictly possible) I
> suggest as a code-hardening measure that _bt_findinsertloc() release
> its "value lock", upon realizing it must complete splits, and then
> complete the split or splits known to be required. It would finally
> report that it "couldn't find" an insertion location to
> _bt_doinsert(), which would then retry from the start, just like when
> _bt_check_unique() finds an inconclusive conflict. The only difference
> is that we don't have an xact to wait on. We haven't actually done
> anything extra that makes this later "goto top;" any different to the
> existing one.
>
> This should occur so infrequently that it isn't worth trying harder,
> or worth differentiating between the UNIQUE_CHECK_NO and
> !UNIQUE_CHECK_NO cases when retrying. This also removes the more
> general risk of holding an extra buffer lock during page split
> completion.

Yeah, that would work too. It seems safe enough as it is, though, so I 
don't see the point.

> It kind of looks like _bt_findinsertloc() doesn't have this bug,
> because in my scenario _bt_finish_split() is called with both the
> value lock and its right page locked (so the right page is the left
> page for _bt_finish_split()'s purposes). But when you take another
> look, and realize that _bt_finish_split() releases its locks, and so
> once again only the original value lock will be held when
> _bt_finish_split() returns, and so the downlink is there to skip the
> value locked page, you realize that the bug does exist (assuming that
> I haven't failed to consider some third factor and am not otherwise
> mistaken).

When inserting, the scan for the right insert location always begins 
from the first page where the key can legally go to. Inserting a missing 
downlink doesn't change what that page is - it just makes it faster to 
find, by reducing the number of right-links you need to follow.

PS. Thanks for looking into this again! These B-tree changes really need 
thorough review.

- Heikki



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Peter Geoghegan
Date:
On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> The "left sibling" referred to here is "the first page this key could
>> be on", an important concept for unique index enforcement.
>
> No, it's not "the first page this key could be on".

Well, it may be initially. I could have been more cautious about the
terminology here.

> Also note that _bt_moveright() also finishes any incomplete splits it
> encounters (when called for an insertion). So _bt_findinsertloc() never gets
> called on a page with the incomplete-split flag set. It might encounter one
> when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?). It's a nice property, since it
makes the extra locking while completing a split within
_bt_findinsertloc() particularly infrequent. But, that might also be a
bad thing, when considered from a different perspective.

> If I understood correctly, the tree looks like this before the insertion:
>
> Parent page:
> +-------------+
> |             |
> | 9 -> A      |
> +-------------+
>
> Leaf A
> +-------------+
> | HI-key: 9   |
> |             |
> | 7 8 9       |
> +-------------+
>
> And after insertion and incomplete split:
>
> Parent page
> +-------------+
> |             |
> | 9 -> A      |
> +-------------+
>
> Leaf A              Leaf B
> +--------------+     +-------------+
> | HI-key: 8    |     | HI-key: 9   |
> | (INCOMPLETE_ |     |             |
> | SPLIT)       | <-> |             |
> |              |     |             |
> |  7   7*   8  |     |   9         |
> +--------------+     +-------------+

> After the split is finished, the tree looks like this:
>
> Parent page
> +-------------+
> | 8 -> A      |
> | 9 -> B      |
> +-------------+
>
> Leaf A              Leaf B
> +-------------+     +-------------+
> | HI-key: 8   |     | HI-key: 9   |
> |             | <-> |             |
> |  7   7*  8  |     |   9         |
> +-------------+     +-------------+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?

>> Regardless of whether or not I have these details exactly right (that
>> is, regardless of whether or not this scenario is strictly possible) I
>> suggest as a code-hardening measure that _bt_findinsertloc() release
>> its "value lock", upon realizing it must complete splits, and then
>> complete the split or splits known to be required. It would finally
>> report that it "couldn't find" an insertion location to
>> _bt_doinsert(), which would then retry from the start, just like when
>> _bt_check_unique() finds an inconclusive conflict.
>
> Yeah, that would work too. It seems safe enough as it is, though, so I don't
> see the point.

Well, it would be nice to not have to finish the page split in what is
a particularly critical path, with that extra buffer lock. It's not
strictly necessary, but then it is theoretically safer, and certainly
much clearer. The fact that this code is so seldom executed is one
issue that made me revisit this. On the other hand, I can see why
you'd want to avoid cluttering up the relatively comprehensible
_bt_doinsert() function if it could be avoided. I defer to you.

> PS. Thanks for looking into this again! These B-tree changes really need
> thorough review.

You're welcome. Hopefully my questions will lead you in a useful
direction, even if my concerns turn out to be, in the main, unfounded.:-)

It previously wasn't in evidence that you'd considered these
interactions, and I feel better knowing that you have.
-- 
Peter Geoghega



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Heikki Linnakangas
Date:
On 05/27/2014 09:47 PM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Also note that _bt_moveright() also finishes any incomplete splits it
>> encounters (when called for an insertion). So _bt_findinsertloc() never gets
>> called on a page with the incomplete-split flag set. It might encounter one
>> when it moves right, but never the first page.
>
> Fair enough, but I don't think that affects correctness either way (I
> don't think that you meant to imply that this was a necessary
> precaution that you'd taken - right?).

Right.

>> If I understood correctly, the tree looks like this before the insertion:
>>
>> Parent page:
>> +-------------+
>> |             |
>> | 9 -> A      |
>> +-------------+
>>
>> Leaf A
>> +-------------+
>> | HI-key: 9   |
>> |             |
>> | 7 8 9       |
>> +-------------+
>>
>> And after insertion and incomplete split:
>>
>> Parent page
>> +-------------+
>> |             |
>> | 9 -> A      |
>> +-------------+
>>
>> Leaf A              Leaf B
>> +--------------+     +-------------+
>> | HI-key: 8    |     | HI-key: 9   |
>> | (INCOMPLETE_ |     |             |
>> | SPLIT)       | <-> |             |
>> |              |     |             |
>> |  7   7*   8  |     |   9         |
>> +--------------+     +-------------+
>
>> After the split is finished, the tree looks like this:
>>
>> Parent page
>> +-------------+
>> | 8 -> A      |
>> | 9 -> B      |
>> +-------------+
>>
>> Leaf A              Leaf B
>> +-------------+     +-------------+
>> | HI-key: 8   |     | HI-key: 9   |
>> |             | <-> |             |
>> |  7   7*  8  |     |   9         |
>> +-------------+     +-------------+
>
> How did the parent page change between before and after the final
> atomic operation (page split completion)? What happened to "9 -> A"?

Ah, sorry, I got that wrong. The downlinks store the *low* key of the 
child page, not the high key as I depicted. Let me try again:

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
> Anyway, the concern is that there may be problems when we call
> _bt_finish_split() with that left sibling locked thoughout (i.e.
> finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
> itself has a right sibling from the incomplete split (which is usually
> the value lock page's right-right sibling)). I'm not concerned about
> performance, since as the comment says it ought to be an infrequent
> occurrence. I also believe that there are no deadlock hazards. But
> consider this scenario:
>
> * We insert the value 7 into an int4 unique index. We need to split
> the leaf page. We run out of memory or something, and ours is an
> incomplete page split. Our transaction is aborted. For the sake of
> argument, suppose that there are also already a bunch of dead tuples
> within the index with values of 7, 8 and 9.

So, initially the tree looks like this:

Parent page:
+-------------+
|             |
| 7 -> A      |
+-------------+

Leaf A
+-------------+
| HI-key: 9   |
|             |
| 7 8 9       |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
|             |
| 7 -> A      |
+-------------+

Leaf A              Leaf B
+--------------+     +-------------+
| HI-key: 8    |     | HI-key: 9   |
| (INCOMPLETE_ |     |             |
| SPLIT)       | <-> |             |
|              |     |             |
|  7   7*  8   |     |    9        |
+--------------+     +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above 
split)

> * Another inserter of the value 7 comes along. It follows exactly the
> same downlink as the first, now aborted inserter (suppose the
> downlink's value is 9). It also locks the same leaf page to establish
> a "value lock" in precisely the same manner. Finding no room on the
> first page, it looks further right, maintaining its original "value
> lock" throughout. It finishes the first inserter's split of the
> non-value-lock page - a new downlink is inserted into the parent page,
> with the value 8. It then releases all buffer locks except the first
> one/original "value lock". A physical insertion has yet to occur.

The downlink of the original page cannot contain 9. Because, as I now 
remember ;-), the downlinks contain low-keys, not high keys.

- Heikki



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Peter Geoghegan
Date:
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
> page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

-- 
Peter Geoghegan



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Heikki Linnakangas
Date:
On 05/27/2014 11:30 PM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
>> page, not the high key as I depicted. Let me try again:
>
> Would you mind humoring me, and including a corrected final
> post-downlink-insert diagram, when the split is fully complete? You
> omitted that.

Sure:

Parent page
+-------------+
| 7 -> A      |
| 8 -> B      |
+-------------+

Leaf A              Leaf B
+--------------+     +-------------+
| HI-key: 8    |     | HI-key: 9   |
|              |     |             |
|              | <-> |             |
|              |     |             |
|  7   7*  8   |     |    9        |
+--------------+     +-------------+

- Heikki



Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
<div dir="ltr">Claudio Freire and I are proposing new functionality for Postgresql <br />to extend the scope of
prefetchingand also exploit posix asynchronous IO<br />when doing prefetching,    and have a patch based on 9.4dev<br
/>readyfor consideration.<br /><br />This topic has cropped up at irregular intervals over the years,<br />e.g. this
threadback in 2012<br />   <a
href="www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com"
target="_blank">www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com</a><br
/>andthis thread more recently<br />  
http://www.postgresql.org/message-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com<br/><br />We
nowhave an implementation which gives useful performance improvement<br />as well as other advantages compared to what
iscurrently available,<br />at least for certain environments.<br /><br />Below I am pasting the README we have written
forthis new functionality<br />which mentions some of the measurements, advantages (and disadvantages)<br />and we
welcomeall and any comments on this.<br /><br />I will send the patch to commitfest later, once this email is posted to
hackers,<br/>so that anyone who wishes can try it,  or apply directly to me if you wish.<br />The patch is currently
basedon 9.4dev but a version based on 9.3.4<br />will be available soon if anyone wants that.    The patch is large 
(43files)<br />so non-trivial to review,   but any comments on it (when posted) will be<br />appreciated and acted
on.   Note that at present the only environment<br />in which it has been applied and tested is linux.<br /><br />John
Lumby   <br />__________________________________________________________________________________________________<br
/><br/><br />Postgresql  --   Extended Prefetching using Asynchronous IO<br
/>============================================================<br/><br />Postgresql currently (9.3.4) provides a
limitedprefetching capability<br />using posix_fadvise to give hints to the Operating System kernel<br />about which
pagesit expects to read in the near future.<br />This capability is used only during the heap-scan phase of
bitmap-indexscans.<br />It is controlled via the effective_io_concurrency configuration parameter.<br /><br />This
capabilityis now extended in two ways :<br />   .   use asynchronous IO into Postgresql shared buffers as an<br
/>      alternative to posix_fadvise<br />   .   Implement prefetching in other types of scan :<br />            . 
non-bitmap(i.e. simple) index scans - index pages<br />                     currently only for B-tree indexes.<br
/>                   (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)<br />            .  non-bitmap
(i.e.simple) index scans - heap pages<br />                          currently only for B-tree indexes.<br
/>           .  simple heap scans<br /><br />Posix asynchronous IO is chosen as the function library for asynchronous
IO,<br/>since this is well supported and also fits very well with the model of<br />the prefetching process, 
particularlyas regards checking for completion<br />of an asynchronous read.    On linux,   Posix asynchronous IO is
provided<br/>in the librt library.    librt uses independently-schedulable threads to<br />achieve the
asynchronicity,  rather than kernel functionality.<br /><br />In this implementation,  use of asynchronous IO is
limitedto prefetching<br />while performing one of the three types of scan<br />            .  B-tree bitmap index scan
-heap pages    (as already exists)<br />            .  B-tree non-bitmap (i.e. simple) index scans - index and heap
pages<br/>            .  simple heap scans<br />on permanent relations.   It is not used on temporary tables nor for
writes.<br/><br />The advantages of Posix asynchronous IO into shared buffers<br />compared to posix_fadvise are :<br
/>  .   Beneficial for non-sequential access patterns as well as sequential<br />   .   No restriction on the kinds of
IOwhich can be used<br />       (other kinds of asynchronous IO impose restrictions such as<br />        buffer
alignment, use of non-buffered IO).<br />   .   Does not interfere with standard linux kernel read-ahead
functionality.<br/>       (It has been stated in <br
/> www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com<br/>       that
:<br/>          "the kernel stops doing read-ahead when a call to posix_fadvise comes.<br />           I noticed the
performancehit, and checked the kernel's code.<br />           It effectively changes the prediction mode from
sequentialto fadvise,<br />           negating the (assumed) kernel's prefetch logic")<br />   .   When the read
requestis issued after a prefetch has completed,<br />       no delay associated with a kernel call to copy the page
from<br/>       kernel page buffers into the Postgresql shared buffer,<br />       since it is already there.<br
/>      Also,   in a memory-constrained environment,   there is a greater<br />       probability that the prefetched
pagewill "stick" in memory<br />       since the linux kernel victimizes the filesystem page cache in preference<br
/>      to swapping out user process pages.<br />   .   Statistics on prefetch success can be gathered (see
"Statistics"below)<br />       which helps the administrator to tune the prefetching settings.<br /><br />These
benefitsare most likely to be obtained in a system whose usage profile<br />(e.g. from iostat)  shows:<br />     .  
highIO wait from mostly-read activity<br />     .   disk access pattern is not entirely sequential<br />         (so
kernelreadahead can't predict it but postgresql can)<br />     .   sufficient spare idle CPU to run the librt
pthreads<br/>         or,  stated another way,    the CPU subsystem is relatively powerful<br />         compared to
thedisk subsystem.<br />In such ideal conditions,  and with a workload with plenty of index scans,<br />around 10% -
20%improvement in throughput has been achieved.<br />In an admittedly extreme environment measured by this author,   
witha workload<br />consisting of 8 client applications each running similar complex queries<br />(same query structure
butdifferent predicates and constants),<br />including 2 Bitmap Index Scans and 17 non-bitmap index scans,<br />on a
dual-coreIntel laptop (4 hyperthreads) with the database on a single<br />USB3-attached 500GB disk drive, and no part
ofthe database in filesystem buffers<br />initially,  (filesystem freshly mounted),  comparing unpatched build<br
/>usingposix_fadvise with effective_io_concurrency 4 against same build patched<br />with async IO and
effective_io_concurrency4 and max_async_io_prefetchers 32,<br />elapse time repeatably improved from around 640-670
secondsto around 530-550 seconds,<br />a 17% - 18% improvement. <br /><br />The disadvantages of Posix asynchronous IO
comparedto posix_fadvise are:<br />     .   probably higher CPU utilization:<br />         Firstly, the extra work
performedby the librt threads adds CPU<br />         overhead, and secondly, if the asynchronous prefetching is
effective,<br/>         then it will deliver better (greater) overlap of CPU with IO, which<br />         will reduce
elapsedtimes and hence increase CPU utilization percentage<br />         still more (during that shorter elapsed
time).<br/>     .   more context switching,  because of the additional threads.<br /><br /><br />Statistics:<br
/>___________<br/><br />A number of additional statistics relating to effectiveness of asynchronous IO<br />are
providedas an extension of the existing pg_stat_statements loadable module.<br />Refer to the appendix "Additional
SuppliedModules" in the current<br />PostgreSQL Documentation for details of this module.<br /><br />The following
additionalstatistics are provided for asynchronous IO prefetching:<br /><br />    . aio_read_noneed  :   number of
prefetchesfor which no need for prefetch as block already in buffer pool<br />    . aio_read_discrd  :   number of
prefetchesfor which buffer not subsequently read and therefore discarded<br />    . aio_read_forgot  :   number of
prefetchesfor which buffer not subsequently read and then forgotten about<br />    . aio_read_noblok  :   number of
prefetchesfor which no available BufferAiocb  control block<br />    . aio_read_failed  :   number of aio reads for
whichaio itself failed or the read failed with an errno<br />    . aio_read_wasted  :   number of aio reads for which
in-progressaio cancelled and disk block not used<br />    . aio_read_waited  :   number of aio reads for which disk
blockused but had to wait for it<br />    . aio_read_ontime  :   number of aio reads for which disk block used and
readyon time when requested<br /><br />Some of these are (hopefully) self-explanatory.    Some additional notes:<br
/><br/>    . aio_read_discrd and aio_read_forgot  :<br />                    prefetch was wasted work since the buffer
wasnot subsequently read<br />                    The discrd case indicates that the scanner realized this and
discardedthe buffer,<br />                    whereas the forgot case indicates that the scanner did not realize it,<br
/>                   which should not normally occur.<br />                    A high number in either suggests
loweringeffective_io_concurrency.<br /><br />    . aio_read_noblok  :   <br />                    Any significant
numberin relation to all the other numbers indicates that<br />                    max_async_io_prefetchers should be
increased.<br/><br />    . aio_read_waited  :<br />                    The page was prefetched but the asynchronous
readhad not completed by the time the<br />                    scanner requested to read it.     causes extra overhead
inwaiting and indicates<br />                    prefetching is not providing much if any benefit.<br
/>                   The disk subsystem may be underpowered/overloaded in relation to the available CPU power.<br /><br
/>   . aio_read_ontime  :<br />                    The page was prefetched and the asynchronous read had completed by
thetime the<br />                    scanner requested to read it.     Optimal behaviour.      If this number if
large<br/>                    in relation to all the other numbers except (possibly) aio_read_noneed,<br
/>                   then prefetching is working well.<br /><br />To create the extension with support for these
additionalstatistics, use the following syntax:<br />     CREATE EXTENSION pg_stat_statements VERSION '1.3'<br />or, 
ifyou run the new code against an existing database which already has the extension<br />( see installation and
migrationbelow ),  you can <br />     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'<br /><br />A suggested set of
commandsfor displaying these statistics might be :<br /><br /> /*  OPTIONALLY */ DROP extension pg_stat_statements;<br
/>                  CREATE extension pg_stat_statements VERSION '1.3';<br /> /*  run your workload   */<br
/>                 select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read
,blk_read_time , blk_write_time \<br />                    , aio_read_noneed , aio_read_noblok , aio_read_failed ,
aio_read_wasted, aio_read_waited , aio_read_ontime , aio_read_forgot       \<br />                      from
pg_stat_statementswhere shared_blks_read > 0;<br /><br /><br />Installation and Build Configuration:<br
/>_____________________________________<br/><br />1. First -  a prerequsite:<br />#  as well as requiring all the usual
packagebuild tools such as gcc , make etc,<br />#  as described in the instructions for building postgresql,<br /># 
thefollowing is required :<br />    gnu autoconf at version 2.69 :<br /># run the following command<br />autoconf -V<br
/>#it *must* return<br />autoconf (GNU Autoconf) 2.69<br /><br />2. If you don't have it or it is a different
version,<br/>then you must obtain version 2.69 (which is the current version)<br />from your distribution provider or
fromthe gnu software download site.<br /><br />3. Also you must have the source tree for postgresql version 9.4
(developmentversion).<br />#   all the following commands assume your current working directory is the top of the
sourcetree.<br /><br />4. cd to top of source tree :<br />#   check it appears to be a postgresql source tree<br />ls
-ldconfigure.in src<br />#   should show both the file and the directory<br />grep PostgreSQL COPYRIGHT<br />#   should
showPostgreSQL Database Management System<br /><br />5. Apply the patch :<br />patch -b -p0 -i
<patch_file_path><br/>#   should report no errors, 42 files patched (see list at bottom of this README)<br />#  
andall hunks applied<br />#  check the patch was appplied to configure.in<br />ls -ld configure.in.orig configure.in<br
/>#  should show both files<br /><br />6. Rebuild the configure script with the patched configure.in :<br />mv
configureconfigure.orig;<br />autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;<br
/>ls-lrt configure.orig configure;<br /><br />7. run the new configure script :<br />#   if you have run configure
before,<br/>#   then you may first want to save existing config.status and config.log if they exist,<br />#   and then
specifysame configure flags and options as you specified before.<br />#   the patch does not alter or extend the set of
configureoptions<br />#   if unsure,   run ./configure --help<br />#   if still unsure,   run ./configure<br
/>./configure<other configure options as desired><br /><br /><br /><br />8. now check that configure decided that
thisenvironment supports asynchronous IO :<br />grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h<br /># 
itshould show<br />#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1<br />#  if not,  apparently your environment does not
supportasynch IO  -<br />#  the config.log will show how it came to that conclusion,<br />#  also check for :<br />#   
.a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)<br />#    . your gcc must
supportthe atomic compare_and_swap __sync_bool_compare_and_swap built-in function<br />#  do not proceed without this
definebeing set.<br /><br />9. do you want to use the new code on an existing cluster<br />   that was created using
thesame code base but without the patch?<br />   If so then run this nasty-looking command :<br />   (cut-and-paste it
intoa terminal window or a shell-script file)<br />   Otherwise continue to step 10.<br />   see Migration note below
forexplanation.<br />###############################################################################################<br
/>  fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {<br />       bkfl="${fl}.bak${bkx}"; if [[
-a${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;<br />   }; done;<br />   if [[ -a ${bkfl} ]]; then echo "sorry cannot
finda backup name for $fl";<br />   elif [[ -a $fl ]]; then {<br />       mv $fl $bkfl && {<br />          sed
-e"/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;<br />          str="diff -w $bkfl
$fl";echo"$str"; eval "$str";<br />       };<br />   };<br />   else echo "ooopppss $fl is missing";<br />   fi;<br
/>###############################################################################################<br/># it should
reportsomething like<br />diff -w Makefile.global.bak0 Makefile.global<br />222c222<br />< CFLAGS = XXXX<br />---<br
/>>CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br />#   where XXXX is some set of flags<br /><br /><br />10.
nowrun the rest of the build process as usual  -<br />    follow instructions in file INSTALL if that file exists,<br
/>   else e.g. run<br />make && make install<br /><br />If the build fails with the following error:<br
/>undefinedreference to `aio_init'<br />Then edit the following file<br />src/include/pg_config_manual.h<br />and add
thefollowing line at the bottom:<br /><br />#define DONT_HAVE_AIO_INIT<br /><br />and then run<br />make clean
&&make && make install<br />See notes to section Runtime Configuration below for more information on
this.<br/><br /><br /><br />Migration , Runtime Configuration, and Use:<br
/>___________________________________________<br/><br /><br />Database Migration:<br />___________________<br /><br
/>Thenew prefetching code for non-bitmap index scans introduces a new btree-index<br />function named
btpeeknexttuple.   The correct way to add such a function involves<br />also adding it to the catalog as an internal
functionin pg_proc.<br />However,  this results in the new built code considering an existing database to be<br
/>incompatible, i.e requiring backup on the old code and restore on the new.<br />This is normal behaviour for
migrationto a new version of postgresql,  and is<br />also a valid way of migrating a database for use with this
asynchronousIO feature,<br />but in this case it may be inconvenient.<br /><br />As an alternative,    the new code may
becompiled with the macro define<br />AVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br />which does what it says by not altering
thecatalog.   The patched build can then<br />be run against an existing database cluster initdb'd using the unpatched
build.<br/><br />There are no known ill-effects of so doing,  but :<br />     .  in any case,  it is strongly suggested
tomake a backup of any precious database<br />        before accessing it with a patched build<br />     .  be aware
thatif this asynchronous IO feature is eventually released as part of postgresql,<br />        migration will probably
berequired anyway.<br /><br />This option to avoid catalog migration is intended as a convenience for a quick test,<br
/>andalso makes it easier to obtain performance comparisons on the same database.<br /><br /><br /><br />Runtime
Configuration:<br/>______________________<br /><br />One new configuration parameter settable in postgresql.conf and<br
/>inany other way as described in the postgresql documentation :<br /><br />max_async_io_prefetchers<br />  Maximum
numberof background processes concurrently using asynchronous<br />  librt threads to prefetch pages into shared memory
buffers<br/><br />This number can be thought of as the maximum number<br />of librt threads concurrently active,   each
workingon a list of<br />from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).<br /><br />In practice,    this
numbersimply controls how many prefetch requests in total<br />may be active concurrently :<br />       
max_async_io_prefetchers* target_prefetch_pages ( see note 1)<br /><br />default is max_connections/6<br />and recall
thatthe default for max_connections is 100<br /><br /><br />note 1  a number based on effective_io_concurrency and
approximatelyn * ln(n)<br />        where n is effective_io_concurrency<br /><br />note 2  Provided that the gnu
extensionto Posix AIO which provides the<br />aio_init() function is present,   then aio_init() is called<br />to set
thelibrt maximum number of threads to max_async_io_prefetchers,<br />and to set the maximum number of concurrent aio
readrequests to the product of<br />        max_async_io_prefetchers * target_prefetch_pages<br /><br /><br />As well
asthis regular configuration parameter,<br />there are several other parameters that can be set via environment
variable.<br/>The reason why they are environment vars rather than regular configuration parameters<br />is that it is
notexpected that they should need to be set,   but they may be useful :<br />                variable name        
values                 default        meaning<br />   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                   
Y        whether to prefetch bitmap heap scans<br />   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]  
256,N     whether to prefetch  non-bitmap index scans<br
/>                                                                   also numeric size of list of prefetched blocks<br
/>                                                                   also whether to prefetch
forward-sequential-patternindex pages<br />   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y        
whetherto prefetch heap pages in non-bitmap index scans<br />   PG_TRY_PREFETCHING_FOR_HEAP       
[Y|N]                   N         whether to prefetch relation (un-indexed) heap scans<br /><br /><br />The setting for
PG_TRY_PREFETCHING_FOR_ISCANis a litle complicated.<br />It can be set to Y or N to control prefetching of  non-bitmap
indexscans;<br />But in addition it can be set to an integer,   which both implies Y<br />and also sets the size of a
listused to remember prefetched but unread heap pages.<br />This list is an optimization used to avoid re-prefetching
andmaximise the potential<br />set of prefetchable blocks indexed by one index page.<br />And if set to an integer, 
thisinteger may be followed by either ,Y or ,N<br />to specify to prefetch index pages which are being accessed
forward-sequentially.<br/>It has been found that prefetching is not of great benefit for this access pattern,<br />and
soit is not the default,  but also does no harm (provided sufficient CPU capacity).<br /><br /><br /><br />Usage :<br
/>______<br/><br /><br />There are no changes in usage other than as noted under Configuration and Statistics.<br
/>However,  in order to assess benefit from this feature,   it will be useful to<br />understand the query access plans
ofyour workload using EXPLAIN.    Before doing that,<br />make sure that statistics are up to date using ANALYZE.<br
/><br/><br /><br />Internals:<br />__________<br /><br /><br />Internal changes span two areas and the interface
betweenthem :<br /><br /> .  buffer manager layer<br /> .  programming interface for scanner to call buffer manager<br
/> . scanner layer<br /><br /> .  buffer manager layer<br />    ____________________<br /><br />    changes comprise
:<br/>       .   allocating,  pinning , unpinning buffers<br />            this is complex and discussed briefly below
in"Buffer Management"<br />       .   acquiring and releasing a BufferAiocb, the control block<br />           
associatedwith a single aio_read,  and checking for its completion<br />            a new file, 
backend/storage/buffer/buf_async.c,provides three new functions,<br />                  BufStartAsync       
BufReleaseAsync           BufCheckAsync<br />            which handle this.<br />       .   calling librt asynch io
functions<br/>            this follows the example of all other filesystem interfaces<br />            and is
straightforward.   <br />            two new functions are provided in fd.c:<br />                  
FileStartaio       FileCompleteaio<br />            and corresponding interfaces in smgr.c<br /><br /> .  programming
interfacefor scanner to call buffer manager<br />    ________________________________________________________<br
/>      . calling interface for existing function PrefetchBuffer is modified :<br />           .  one new argument,  
BufferAccessStrategystrategy<br />           .  now returns an int return code which indicates :<br
/>                    whether pin count on buffer has been increased by 1<br />                     whether block was
alreadypresent in a buffer<br />       .  new function DiscardBuffer<br />           .  discard buffer used for a
previouslyprefetched page<br />                 which scanner decides it does not want to read.<br />           .  same
argumentsas for PrefetchBuffer except for omission of BufferAccessStrategy<br />           .  note - this is different
fromthe existing function ReleaseBuffer<br />                     in that ReleaseBuffer takes a buffer_descriptor as
argument<br/>                     for a buffer which has been read, but has similar purpose.<br /><br /> .  scanner
layer<br/>    _____________<br />        common to all scanners is that the scanner which wishes to prefetch must do
twothings:<br />          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them<br
/>                nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)<br />         
. remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),<br />                
removingeach page from this list if and when it subsequently reads the page.<br />          .  at end of scan,  call
DiscardBufferfor every remembered (i.e. prefetched not unread) page<br />       how this list of prefetched pages is
implementedvaries for each of the three scanners and four scan types:<br />            .  bitmap index scan - heap
pages<br/>            .  non-bitmap (i.e. simple) index scans - index pages<br />            .  non-bitmap (i.e.
simple)index scans - heap pages<br />            .  simple heap scans<br />       The consequences of forgetting to
callDiscardBuffer on a prefetched but unread page are:<br />            .   counted in aio_read_forgot  (see
"Statistics"above)<br />            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "<br
/>                 (the buffer is released at commit)<br />       This does sometimes happen ...<br />     <br /><br
/><br/>Buffer Management<br />_________________<br /><br />With async io,   PrefetchBuffer must allocate and pin a
buffer, which is relatively straightforward,<br />but also every other part of buffer manager must know about the
possibilitythat a buffer may be in<br />a state of async_io_in_progress state and be prepared to determine the possible
completion.<br/>That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.<br
/>PosixAsynchronous IO provides a means for waiting on this or another task's read if in progress,<br />namely
aio_suspend(), which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO<br />are called
aspart of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,<br />and they
donot track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in<br />a separate set of shared
controlblocks,  the BufferAiocb list -<br />refer to     include/storage/buf_internals.h<br />Checking asynchronous io
statusis handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.<br />Read the commentary for this
functionfor more details.<br /><br />Pinning and unpinning of buffers is the most complex aspect of asynch io
prefetching,<br/>and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.<br
/>Whena backend BK2 requests ReadBuffer of a page for which asynch read is in progress,<br />buffer manager has to
determinewhich backend BK1 pinned this buffer during previous PrefetchBuffer,<br />and for example must not be
re-pinneda second time if BK2 is BK1.<br />Information concerning which backend initiated the prefetch is held in the
BufferAiocb.<br/><br />The trickiest case concerns the scenario in which :<br />   .  BK1 initiates prefetch and
acquiresa pin<br />   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on<br
/>        releases it by ReleaseBuffer.<br />   .  Since the asynchronous IO is no longer in progress,     there is no
longerany<br />         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a<br
/>        "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.<br />   . 
Thesolution to this problem is to invent the concept of a "banked" pin,<br />      which is a pin obtained when
prefetchwas issued,   identied as in "banked" status only if and when<br />      the associated asynchronous IO
terminates, and redeemable by the next use by same task,<br />      either by ReadBuffer or DiscardBuffer.<br />     
Thepid of the backend which holds a banked pin on a buffer (there can be at most one such backend)<br />      is stored
inthe buffer descriptor.<br />      This is done without increasing size of the buffer descriptor,  which is important
since<br/>      there may be a very large number of these.     This does overload the relevant field in the
descriptor.<br/>      Refer to include/storage/buf_internals.h for more details<br />      and search for
BM_AIO_PREFETCH_PIN_BANKEDin storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c<br /><br
/>______________________________________________________________________________<br/>The following 43 files are changed
inthis feature (output of the patch command) :<br /><br />patching file configure.in<br />patching file
contrib/pg_stat_statements/pg_stat_statements--1.3.sql<br/>patching file contrib/pg_stat_statements/Makefile<br
/>patchingfile contrib/pg_stat_statements/pg_stat_statements.c<br />patching file
contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql<br/>patching file config/c-library.m4<br />patching file
src/backend/postmaster/postmaster.c<br/>patching file src/backend/executor/nodeBitmapHeapscan.c<br />patching file
src/backend/executor/nodeIndexscan.c<br/>patching file src/backend/executor/instrument.c<br />patching file
src/backend/storage/buffer/Makefile<br/>patching file src/backend/storage/buffer/bufmgr.c<br />patching file
src/backend/storage/buffer/buf_async.c<br/>patching file src/backend/storage/buffer/buf_init.c<br />patching file
src/backend/storage/smgr/md.c<br/>patching file src/backend/storage/smgr/smgr.c<br />patching file
src/backend/storage/file/fd.c<br/>patching file src/backend/storage/lmgr/proc.c<br />patching file
src/backend/access/heap/heapam.c<br/>patching file src/backend/access/heap/syncscan.c<br />patching file
src/backend/access/index/indexam.c<br/>patching file src/backend/access/index/genam.c<br />patching file
src/backend/access/nbtree/nbtsearch.c<br/>patching file src/backend/access/nbtree/nbtinsert.c<br />patching file
src/backend/access/nbtree/nbtpage.c<br/>patching file src/backend/access/nbtree/nbtree.c<br />patching file
src/backend/nodes/tidbitmap.c<br/>patching file src/backend/utils/misc/guc.c<br />patching file
src/backend/utils/mmgr/aset.c<br/>patching file src/include/executor/instrument.h<br />patching file
src/include/storage/bufmgr.h<br/>patching file src/include/storage/smgr.h<br />patching file
src/include/storage/fd.h<br/>patching file src/include/storage/buf_internals.h<br />patching file
src/include/catalog/pg_am.h<br/>patching file src/include/catalog/pg_proc.h<br />patching file
src/include/pg_config_manual.h<br/>patching file src/include/access/nbtree.h<br />patching file
src/include/access/heapam.h<br/>patching file src/include/access/relscan.h<br />patching file
src/include/nodes/tidbitmap.h<br/>patching file src/include/utils/rel.h<br />patching file
src/include/pg_config.h.in<br/><br /><br />Future Possibilities:<br />____________________<br /><br />There are several
possibleextensions of this feature :<br />   .   Extend prefetching of index scans to types of index<br />       other
thanB-tree.<br />       This should be fairly straightforward,  but requires some<br />       good base of
benchmarkableworkloads to prove the value.<br />   .   Investigate why asynchronous IO prefetching does not greatly<br
/>      improve sequential relation heap scans and possibly find how to<br />       achieve a benefit.<br />   .  
Buildknowledge of asycnhronous IO prefetching into the<br />       Query Planner costing.<br />       This is far from
straightforward.   The Postgresql Query Planner's<br />       costing model is based on resource consumption rather
thanelapsed time.<br />       Use of asynchronous IO prefetching is intended to improve elapsed time<br />       as the
expenseof (probably) higher resource consumption.<br />       Although Costing understands about the reduced cost of
readingbuffered<br />       blocks, it does not take asynchronicity or overlap of CPU with disk<br />       into
account. A naive approach might be to try to tweak the Query<br />       Planner's Cost Constant configuration
parameters<br/>       such as seq_page_cost , random_page_cost<br />       but this is hazardous as explained in the
Documentation.<br/><br /><br /><br />John Lumby,  johnlumby(at)hotmail(dot)com<br /><br /></div> 

Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Peter Geoghegan
Date:
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Fair enough, but I don't think that affects correctness either way (I
>> don't think that you meant to imply that this was a necessary
>> precaution that you'd taken - right?).
>
>
> Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume). If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there. Do you think it would be a fair
stress-test if I was to hack Postgres so that this call never happens
(within _bt_moveright())? I'd also have an incomplete page split occur
at random with a probability of, say, 1% per split. The mechanism
would be much the same as your original test-case for the split patch
- I'd throw an error at the wrong place, although only 1% of the time,
and over many hours.

The concern with the _bt_moveright() call of _bt_finish_split() is
that it might conceal a problem without reliably fixing it,
potentially making isolating that problem much harder.

-- 
Peter Geoghegan



Re: Race condition within _bt_findinsertloc()? (new page split code)

From
Heikki Linnakangas
Date:
On 05/28/2014 02:15 AM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>> Fair enough, but I don't think that affects correctness either way (I
>>> don't think that you meant to imply that this was a necessary
>>> precaution that you'd taken - right?).
>>
>> Right.
>
> So, the comments within _bt_search() suggest that the _bt_moveright()
> call will perform a _bt_finish_split() call opportunistically iff it's
> called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
> reason to not do so in all circumstances though, assuming that it's
> desirable to do so as soon as possible (which I *don't* actually
> assume).

You can't write in a hot standby, so that's one reason to only do it 
when inserting, not when querying. Even when you're not in a hot 
standby, it seems like a nice property that a read-only query doesn't do 
writes. I know we do hint bit updates and other kinds of write-action 
when reading anyway, but still.

> If I'm not mistaken, it's also true that it would be strictly
> speaking correct to never do it there.

Hmm. No, it wouldn't. It is not allowed to have two incompletely-split 
pages adjacent to each other. If you move right to the right-half of an 
incomplete split, i.e. a page that does not have a downlink in its 
parent, and then try to split the page again, _bt_insert_parent() would 
fail to find the location to insert the new downlink to. It assumes that 
there is a downlink to the page being split in the parent, and uses that 
to find the location for the new downlink.

- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Peter Geoghegan
Date:
On Tue, May 27, 2014 at 3:17 PM, John Lumby <johnlumby@hotmail.com> wrote:
> Below I am pasting the README we have written for this new functionality
> which mentions some of the measurements, advantages (and disadvantages)
> and we welcome all and any comments on this.

I think that this is likely to be a useful area to work on, but I
wonder: can you suggest a useful test-case or benchmark to show the
advantages of the patch you posted? You mention a testcase already,
but it's a little short on details. I think it's always a good idea to
start with that when pursuing a performance feature.

Have you thought about things like specialized prefetching for nested
loop joins?

-- 
Peter Geoghegan



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Wed, May 28, 2014 at 6:51 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Have you thought about things like specialized prefetching for nested
> loop joins?

Currently, such a thing would need some non-trivial changes to the
execution nodes, I believe.

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Peter Geoghegan
Date:
On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> For nestloop, correct me if I'm wrong, but index scan nodes don't have
> visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

-- 
Peter Geoghegan



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
I have pasted  below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted,     and I will see if I can dig into that
if I get time  (and it sounds as though Claudio is on it too).

The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.

John

> Date: Wed, 28 May 2014 18:12:23 -0700
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: pg@heroku.com
> To: klaussfreire@gmail.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
>
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.
>
> --
> Peter Geoghegan

____________________________________________________________________________________-
         QUERY PLAN                                                                                                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=801294.81..801294.81 rows=2 width=532)
   CTE deploy_zone_down
     ->  Recursive Union  (cost=1061.25..2687.40 rows=11 width=573)
           ->  Nested Loop  (cost=1061.25..1423.74 rows=1 width=41)
                 ->  Nested Loop  (cost=1061.11..1421.22 rows=14 width=49)
                       ->  Bitmap Heap Scan on entity zone_tree  (cost=1060.67..1175.80 rows=29 width=40)
                             Recheck Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
                             ->  BitmapAnd  (cost=1060.67..1060.67 rows=29 width=0)
                                   ->  Bitmap Index Scan on entity_name  (cost=0.00..139.71 rows=5927 width=0)
                                         Index Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
                                   ->  Bitmap Index Scan on entity_discriminatorx  (cost=0.00..920.70 rows=49636 width=0)
                                         Index Cond: ((discriminator)::text = 'ZONE'::text)
                       ->  Index Scan using metadata_value_owner_id on metadata_value mv  (cost=0.43..8.45 rows=1 width=17)
                             Index Cond: (owner_id = zone_tree.id)
                 ->  Index Scan using metadata_field_pkey on metadata_field mf  (cost=0.15..0.17 rows=1 width=8)
                       Index Cond: (id = mv.field_id)
                       Filter: ((name)::text = 'deployable'::text)
           ->  Nested Loop  (cost=0.87..126.34 rows=1 width=573)
                 ->  Nested Loop  (cost=0.72..125.44 rows=5 width=581)
                       ->  Nested Loop  (cost=0.29..83.42 rows=10 width=572)
                             ->  WorkTable Scan on deploy_zone_down dzd  (cost=0.00..0.20 rows=10 width=540)
                             ->  Index Scan using entity_discriminator_parent_zone on entity ch  (cost=0.29..8.31 rows=1 width=40)
                                   Index Cond: ((parent_id = dzd.zone_tree_id) AND ((discriminator)::text = 'ZONE'::text))
                       ->  Index Scan using metadata_value_owner_id on metadata_value mv_1  (cost=0.43..4.19 rows=1 width=17)
                             Index Cond: (owner_id = ch.id)
                 ->  Index Scan using metadata_field_pkey on metadata_field mf_1  (cost=0.15..0.17 rows=1 width=8)
                       Index Cond: (id = mv_1.field_id)
                       Filter: ((name)::text = 'deployable'::text)
   CTE deploy_zone_tree
     ->  Recursive Union  (cost=0.00..9336.89 rows=21 width=1105)
           ->  CTE Scan on deploy_zone_down dzd_1  (cost=0.00..0.22 rows=11 width=1105)
           ->  Nested Loop  (cost=0.43..933.63 rows=1 width=594)
                 ->  WorkTable Scan on deploy_zone_tree dzt  (cost=0.00..2.20 rows=110 width=581)
                 ->  Index Scan using entity_pkey on entity pt  (cost=0.43..8.46 rows=1 width=21)
                       Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
                       Filter: ((discriminator)::text = ANY ('{VIEW,ZONE}'::text[]))
   CTE forward_host_ip
     ->  Nested Loop  (cost=1.30..149.65 rows=24 width=88)
           ->  Nested Loop  (cost=0.87..121.69 rows=48 width=56)
                 ->  Nested Loop  (cost=0.43..71.61 rows=99 width=48)
                       ->  CTE Scan on deploy_zone_tree dzt_1  (cost=0.00..0.47 rows=1 width=16)
                             Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
                       ->  Index Scan using entity_parent_id on entity host  (cost=0.43..70.14 rows=99 width=40)
                             Index Cond: (parent_id = dzt_1.zone_tree_id)
                             Filter: ((discriminator)::text = 'HOST'::text)
                 ->  Index Scan using entity_link_owner_id on entity_link link  (cost=0.43..0.50 rows=1 width=16)
                       Index Cond: (owner_id = host.id)
                       Filter: ((link_type)::text = ANY ('{IP,IP6}'::text[]))
           ->  Index Scan using entity_pkey on entity address  (cost=0.43..0.57 rows=1 width=40)
                 Index Cond: (id = link.entity_id)
                 Filter: ((discriminator)::text = ANY ('{IP4A,IP6A}'::text[]))
   CTE association_view
     ->  Nested Loop  (cost=0.87..26.29 rows=1 width=75)
           ->  Nested Loop  (cost=0.43..17.82 rows=1 width=56)
                 ->  CTE Scan on deploy_zone_tree dzt_2  (cost=0.00..0.47 rows=1 width=16)
                       Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
                 ->  Index Scan using entity_discriminator_parent_rr on entity record  (cost=0.43..17.34 rows=1 width=48)
                       Index Cond: ((parent_id = dzt_2.zone_tree_id) AND ((discriminator)::text = ANY ('{C,MX,SRV}'::text[])))
           ->  Index Scan using entity_pkey on entity assoc  (cost=0.43..8.46 rows=1 width=27)
                 Index Cond: (id = record.association_id)
   CTE simple_view
     ->  Nested Loop  (cost=0.43..22.27 rows=1 width=48)
           ->  CTE Scan on deploy_zone_tree dzt_3  (cost=0.00..0.47 rows=1 width=16)
                 Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
           ->  Index Scan using entity_discriminator_parent_rr on entity record_1  (cost=0.43..21.79 rows=1 width=40)
                 Index Cond: ((parent_id = dzt_3.zone_tree_id) AND ((discriminator)::text = ANY ('{TXT,HINFO,GENRR,NAPTR}'::text[])))
   CTE max_hist_id
     ->  Result  (cost=0.48..0.49 rows=1 width=0)
           InitPlan 6 (returns $19)
             ->  Limit  (cost=0.43..0.48 rows=1 width=8)
                   ->  Index Only Scan Backward using entity_history_history_id on entity_history xh  (cost=0.43..444052.51 rows=10386347 width=8)
                         Index Cond: (history_id IS NOT NULL)
   CTE relevant_history
     ->  Nested Loop  (cost=0.43..199661.39 rows=3438689 width=28)
           ->  CTE Scan on max_hist_id xh_1  (cost=0.00..0.02 rows=1 width=8)
           ->  Index Scan using entity_history_history_id on entity_history eh  (cost=0.43..156677.76 rows=3438689 width=20)
                 Index Cond: (history_id > xh_1.history_id)
                 Filter: (transaction_type = 'I'::bpchar)
   CTE resource_records
     ->  Unique  (cost=580178.30..584992.46 rows=160472 width=1063)
           ->  Sort  (cost=580178.30..580579.48 rows=160472 width=1063)
                 Sort Key: fip.host_id, fip.host_discriminator, fip.host_parent_id, fip.view_id, ((fip.address_id)::text), fip.host_name, ((fip.address_long1)::text), (((fip.host_long1 & 1::bigint))::text), ((fip.address_parent_id)::text), ((((''::text || (COALESCE(mv_2.longnumber, (-1)::bigint))::text) || ','::text) || (fip.address_discriminator)::text)), rh.long1
                 ->  Append  (cost=203.82..417112.92 rows=160472 width=1063)
                       ->  Hash Join  (cost=203.82..91844.90 rows=137548 width=1136)
                             Hash Cond: ((rh.hist_discrim)::text = (fip.address_discriminator)::text)
                             Join Filter: (rh.history_delta > fip.host_id)
                             ->  CTE Scan on relevant_history rh  (cost=0.00..68773.78 rows=3438689 width=532)
                             ->  Hash  (cost=203.52..203.52 rows=24 width=1128)
                                   ->  Nested Loop Left Join  (cost=0.43..203.52 rows=24 width=1128)
                                         ->  CTE Scan on forward_host_ip fip  (cost=0.00..0.48 rows=24 width=1120)
                                         ->  Index Scan using metadata_value_owner_id on metadata_value mv_2  (cost=0.43..8.45 rows=1 width=24)
                                               Index Cond: (owner_id = fip.host_id)
                       ->  Nested Loop  (cost=0.43..77722.85 rows=5731 width=644)
                             Join Filter: (rh_1.history_delta > av.record_id)
                             ->  Nested Loop Left Join  (cost=0.43..8.48 rows=1 width=636)
                                   ->  CTE Scan on association_view av  (cost=0.00..0.02 rows=1 width=628)
                                         Filter: ((record_discriminator)::text = 'C'::text)
                                   ->  Index Scan using metadata_value_owner_id on metadata_value mv_3  (cost=0.43..8.45 rows=1 width=24)
                                         Index Cond: (owner_id = av.record_id)
                             ->  CTE Scan on relevant_history rh_1  (cost=0.00..77370.50 rows=17193 width=532)
                                   Filter: ((hist_discrim)::text = 'C'::text)
                       ->  Hash Join  (cost=0.04..83402.53 rows=5731 width=636)
                             Hash Cond: ((rh_2.hist_discrim)::text = (av_1.record_discriminator)::text)
                             Join Filter: (rh_2.history_delta > av_1.record_id)
                             ->  CTE Scan on relevant_history rh_2  (cost=0.00..68773.78 rows=3438689 width=532)
                             ->  Hash  (cost=0.02..0.02 rows=1 width=628)
                                   ->  CTE Scan on association_view av_1  (cost=0.00..0.02 rows=1 width=628)
                                         Filter: ((record_discriminator)::text = ANY ('{MX,SRV}'::text[]))
                       ->  Nested Loop  (cost=0.86..79164.06 rows=5731 width=618)
                             Join Filter: (rh_3.history_delta > sv.record_id)
                             ->  Nested Loop Left Join  (cost=0.86..16.94 rows=1 width=610)
                                   ->  Nested Loop Left Join  (cost=0.43..8.48 rows=1 width=588)
                                         ->  CTE Scan on simple_view sv  (cost=0.00..0.02 rows=1 width=580)
                                               Filter: ((record_discriminator)::text = 'TXT'::text)
                                         ->  Index Scan using metadata_value_owner_id on metadata_value mv_4  (cost=0.43..8.45 rows=1 width=24)
                                               Index Cond: (owner_id = sv.record_id)
                                   ->  Index Scan using metadata_value_owner_id on metadata_value txtvalue  (cost=0.43..8.45 rows=1 width=38)
                                         Index Cond: (owner_id = sv.record_id)
                             ->  CTE Scan on relevant_history rh_3  (cost=0.00..77370.50 rows=17193 width=532)
                                   Filter: ((hist_discrim)::text = 'TXT'::text)
                       ->  Hash Join  (cost=0.04..83373.87 rows=5731 width=588)
                             Hash Cond: ((rh_4.hist_discrim)::text = (sv_1.record_discriminator)::text)
                             Join Filter: (rh_4.history_delta > sv_1.record_id)
                             ->  CTE Scan on relevant_history rh_4  (cost=0.00..68773.78 rows=3438689 width=532)
                             ->  Hash  (cost=0.02..0.02 rows=1 width=580)
                                   ->  CTE Scan on simple_view sv_1  (cost=0.00..0.02 rows=1 width=580)
                                         Filter: ((record_discriminator)::text = ANY ('{HINFO,GENRR,NAPTR}'::text[]))
   ->  Sort  (cost=4417.98..4418.48 rows=200 width=532)
         Sort Key: resource_records.discrim
         ->  HashAggregate  (cost=4412.98..4415.98 rows=200 width=532)
               Group Key: resource_records.discrim
               ->  CTE Scan on resource_records  (cost=0.00..3209.44 rows=160472 width=532)
 Planning time: 6.620 ms
(133 rows)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
<p><br /> El 28/05/2014 22:12, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> escribió:<br
/>><br /> > On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <<a
href="mailto:klaussfreire@gmail.com">klaussfreire@gmail.com</a>>wrote:<br /> > > For nestloop, correct me if
I'mwrong, but index scan nodes don't have<br /> > > visibility of the next tuple to be searched for.<br />
><br/> > Nested loop joins are considered a particularly compelling case for<br /> > prefetching,
actually.<p>Ofcourse. I only doubt it can be implemented without not so small changes to all execution nodes.<p>I'll
lookinto it<p>><br /> > --<br /> > Peter Geoghegan<br />