Thread: Race condition within _bt_findinsertloc()? (new page split code)
While speccing out a new B-Tree verification tool, I had the opportunity to revisit a thought I had during review concerning _bt_findinsertloc(): that the new coding is unsafe in the event of deferred split completion of a leaf page of a unique index. To recap, we now do the following in _bt_findinsertloc(): /** If this page was incompletely split, finish the split now. We* do this while holding a lock on the left sibling, whichis not* good because finishing the split could be a fairly lengthy* operation. But this should happen very seldom.*/ if (P_INCOMPLETE_SPLIT(lpageop)) { _bt_finish_split(rel, rbuf, stack); rbuf = InvalidBuffer; continue; } The "left sibling" referred to here is "the first page this key could be on", an important concept for unique index enforcement. It's the first sibling iff we're on our first iteration of the nested for(;;) loop in _bt_findinsertloc(). So the buffer lock held on this left sibling may constitute what in the past I've called a "value lock"; we've established the right to insert our value into the unique index at this point, and the lock will only be released when we're done (regardless of whether or not that buffer/page value lock is on the buffer/page we'll ultimately insert into, or an earlier one). Anyway, the concern is that there may be problems when we call _bt_finish_split() with that left sibling locked thoughout (i.e. finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and itself has a right sibling from the incomplete split (which is usually the value lock page's right-right sibling)). I'm not concerned about performance, since as the comment says it ought to be an infrequent occurrence. I also believe that there are no deadlock hazards. But consider this scenario: * We insert the value 7 into an int4 unique index. We need to split the leaf page. We run out of memory or something, and ours is an incomplete page split. Our transaction is aborted. For the sake of argument, suppose that there are also already a bunch of dead tuples within the index with values of 7, 8 and 9. * Another inserter of the value 7 comes along. It follows exactly the same downlink as the first, now aborted inserter (suppose the downlink's value is 9). It also locks the same leaf page to establish a "value lock" in precisely the same manner. Finding no room on the first page, it looks further right, maintaining its original "value lock" throughout. It finishes the first inserter's split of the non-value-lock page - a new downlink is inserted into the parent page, with the value 8. It then releases all buffer locks except the first one/original "value lock". A physical insertion has yet to occur. * A third inserter of the value comes along. It gets to a later page than the one locked by the second inserter, preferring the newer downlink with value 8 (the internal-page _bt_binsrch() logic ensures this). It exclusive locks that later page/buffer before the second guy gets a chance to lock it once again. It establishes the right to insert with _bt_check_unique(), undeterred by the second inserter's buffer lock/"value lock". The value lock is effectively skipped over. * Both the second and third inserters have "established the right" to insert the same value, 7, and both do so. The unique index has an MVCC-snapshot-wise spurious duplicate, and so is corrupt. Regardless of whether or not I have these details exactly right (that is, regardless of whether or not this scenario is strictly possible) I suggest as a code-hardening measure that _bt_findinsertloc() release its "value lock", upon realizing it must complete splits, and then complete the split or splits known to be required. It would finally report that it "couldn't find" an insertion location to _bt_doinsert(), which would then retry from the start, just like when _bt_check_unique() finds an inconclusive conflict. The only difference is that we don't have an xact to wait on. We haven't actually done anything extra that makes this later "goto top;" any different to the existing one. This should occur so infrequently that it isn't worth trying harder, or worth differentiating between the UNIQUE_CHECK_NO and !UNIQUE_CHECK_NO cases when retrying. This also removes the more general risk of holding an extra buffer lock during page split completion. It kind of looks like _bt_findinsertloc() doesn't have this bug, because in my scenario _bt_finish_split() is called with both the value lock and its right page locked (so the right page is the left page for _bt_finish_split()'s purposes). But when you take another look, and realize that _bt_finish_split() releases its locks, and so once again only the original value lock will be held when _bt_finish_split() returns, and so the downlink is there to skip the value locked page, you realize that the bug does exist (assuming that I haven't failed to consider some third factor and am not otherwise mistaken). Thoughts? -- Peter Geoghegan
On 05/27/2014 09:17 AM, Peter Geoghegan wrote: > While speccing out a new B-Tree verification tool, I had the > opportunity to revisit a thought I had during review concerning > _bt_findinsertloc(): that the new coding is unsafe in the event of > deferred split completion of a leaf page of a unique index. To recap, > we now do the following in _bt_findinsertloc(): > > /* > * If this page was incompletely split, finish the split now. We > * do this while holding a lock on the left sibling, which is not > * good because finishing the split could be a fairly lengthy > * operation. But this should happen very seldom. > */ > if (P_INCOMPLETE_SPLIT(lpageop)) > { > _bt_finish_split(rel, rbuf, stack); > rbuf = InvalidBuffer; > continue; > } > > The "left sibling" referred to here is "the first page this key could > be on", an important concept for unique index enforcement. No, it's not "the first page this key could be on". _bt_findinsertloc() does *not* hold a lock on the first valid page the key could go to. It merely ensures that when it steps to the next page, it releases the lock on the previous page only after acquiring the lock on the next page. Throughout the operation, it will hold a lock on *some* page that could legally hold the inserted value, and it acquires the locks in left-to-right order. This is sufficient for the uniqueness checks, because _bt_unique_check() scans all the pages, and _bt_unique_check() *does* hold the first page locked while it scans the rest of the pages. But _bt_findinsertlock() does not. Also note that _bt_moveright() also finishes any incomplete splits it encounters (when called for an insertion). So _bt_findinsertloc() never gets called on a page with the incomplete-split flag set. It might encounter one when it moves right, but never the first page. > Anyway, the concern is that there may be problems when we call > _bt_finish_split() with that left sibling locked thoughout (i.e. > finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and > itself has a right sibling from the incomplete split (which is usually > the value lock page's right-right sibling)). I'm not concerned about > performance, since as the comment says it ought to be an infrequent > occurrence. I also believe that there are no deadlock hazards. But > consider this scenario: > > * We insert the value 7 into an int4 unique index. We need to split > the leaf page. We run out of memory or something, and ours is an > incomplete page split. Our transaction is aborted. For the sake of > argument, suppose that there are also already a bunch of dead tuples > within the index with values of 7, 8 and 9. If I understood correctly, the tree looks like this before the insertion: Parent page: +-------------+ | | | 9 -> A | +-------------+ Leaf A +-------------+ | HI-key: 9 | | | | 7 8 9 | +-------------+ And after insertion and incomplete split: Parent page +-------------+ | | | 9 -> A | +-------------+ Leaf A Leaf B +--------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | | 9 | +--------------+ +-------------+ where 7* is the newly inserted key, with value 7. (you didn't mention at what point the split happens, but in the next paragraph you said the new downlink has value 8, which implies the above split) > * Another inserter of the value 7 comes along. It follows exactly the > same downlink as the first, now aborted inserter (suppose the > downlink's value is 9). It also locks the same leaf page to establish > a "value lock" in precisely the same manner. Finding no room on the > first page, it looks further right, maintaining its original "value > lock" throughout. It finishes the first inserter's split of the > non-value-lock page - a new downlink is inserted into the parent page, > with the value 8. It then releases all buffer locks except the first > one/original "value lock". A physical insertion has yet to occur. Hmm, I think you got confused at this step. When inserting a 7, you cannot "look further right" to find a page with more space, because the HI-key, 8, on the first page stipulates that 7 must go on that page, not some later page. > * A third inserter of the value comes along. It gets to a later page > than the one locked by the second inserter, preferring the newer > downlink with value 8 (the internal-page _bt_binsrch() logic ensures > this). That's a contradiction: the downlink with value 8 points to the first page, not some later page. After the split is finished, the tree looks like this: Parent page +-------------+ | 8 -> A | | 9 -> B | +-------------+ Leaf A Leaf B +-------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | | <-> | | | 7 7* 8 | | 9 | +-------------+ +-------------+ > Regardless of whether or not I have these details exactly right (that > is, regardless of whether or not this scenario is strictly possible) I > suggest as a code-hardening measure that _bt_findinsertloc() release > its "value lock", upon realizing it must complete splits, and then > complete the split or splits known to be required. It would finally > report that it "couldn't find" an insertion location to > _bt_doinsert(), which would then retry from the start, just like when > _bt_check_unique() finds an inconclusive conflict. The only difference > is that we don't have an xact to wait on. We haven't actually done > anything extra that makes this later "goto top;" any different to the > existing one. > > This should occur so infrequently that it isn't worth trying harder, > or worth differentiating between the UNIQUE_CHECK_NO and > !UNIQUE_CHECK_NO cases when retrying. This also removes the more > general risk of holding an extra buffer lock during page split > completion. Yeah, that would work too. It seems safe enough as it is, though, so I don't see the point. > It kind of looks like _bt_findinsertloc() doesn't have this bug, > because in my scenario _bt_finish_split() is called with both the > value lock and its right page locked (so the right page is the left > page for _bt_finish_split()'s purposes). But when you take another > look, and realize that _bt_finish_split() releases its locks, and so > once again only the original value lock will be held when > _bt_finish_split() returns, and so the downlink is there to skip the > value locked page, you realize that the bug does exist (assuming that > I haven't failed to consider some third factor and am not otherwise > mistaken). When inserting, the scan for the right insert location always begins from the first page where the key can legally go to. Inserting a missing downlink doesn't change what that page is - it just makes it faster to find, by reducing the number of right-links you need to follow. PS. Thanks for looking into this again! These B-tree changes really need thorough review. - Heikki
On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> The "left sibling" referred to here is "the first page this key could >> be on", an important concept for unique index enforcement. > > No, it's not "the first page this key could be on". Well, it may be initially. I could have been more cautious about the terminology here. > Also note that _bt_moveright() also finishes any incomplete splits it > encounters (when called for an insertion). So _bt_findinsertloc() never gets > called on a page with the incomplete-split flag set. It might encounter one > when it moves right, but never the first page. Fair enough, but I don't think that affects correctness either way (I don't think that you meant to imply that this was a necessary precaution that you'd taken - right?). It's a nice property, since it makes the extra locking while completing a split within _bt_findinsertloc() particularly infrequent. But, that might also be a bad thing, when considered from a different perspective. > If I understood correctly, the tree looks like this before the insertion: > > Parent page: > +-------------+ > | | > | 9 -> A | > +-------------+ > > Leaf A > +-------------+ > | HI-key: 9 | > | | > | 7 8 9 | > +-------------+ > > And after insertion and incomplete split: > > Parent page > +-------------+ > | | > | 9 -> A | > +-------------+ > > Leaf A Leaf B > +--------------+ +-------------+ > | HI-key: 8 | | HI-key: 9 | > | (INCOMPLETE_ | | | > | SPLIT) | <-> | | > | | | | > | 7 7* 8 | | 9 | > +--------------+ +-------------+ > After the split is finished, the tree looks like this: > > Parent page > +-------------+ > | 8 -> A | > | 9 -> B | > +-------------+ > > Leaf A Leaf B > +-------------+ +-------------+ > | HI-key: 8 | | HI-key: 9 | > | | <-> | | > | 7 7* 8 | | 9 | > +-------------+ +-------------+ How did the parent page change between before and after the final atomic operation (page split completion)? What happened to "9 -> A"? >> Regardless of whether or not I have these details exactly right (that >> is, regardless of whether or not this scenario is strictly possible) I >> suggest as a code-hardening measure that _bt_findinsertloc() release >> its "value lock", upon realizing it must complete splits, and then >> complete the split or splits known to be required. It would finally >> report that it "couldn't find" an insertion location to >> _bt_doinsert(), which would then retry from the start, just like when >> _bt_check_unique() finds an inconclusive conflict. > > Yeah, that would work too. It seems safe enough as it is, though, so I don't > see the point. Well, it would be nice to not have to finish the page split in what is a particularly critical path, with that extra buffer lock. It's not strictly necessary, but then it is theoretically safer, and certainly much clearer. The fact that this code is so seldom executed is one issue that made me revisit this. On the other hand, I can see why you'd want to avoid cluttering up the relatively comprehensible _bt_doinsert() function if it could be avoided. I defer to you. > PS. Thanks for looking into this again! These B-tree changes really need > thorough review. You're welcome. Hopefully my questions will lead you in a useful direction, even if my concerns turn out to be, in the main, unfounded.:-) It previously wasn't in evidence that you'd considered these interactions, and I feel better knowing that you have. -- Peter Geoghega
On 05/27/2014 09:47 PM, Peter Geoghegan wrote: > On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Also note that _bt_moveright() also finishes any incomplete splits it >> encounters (when called for an insertion). So _bt_findinsertloc() never gets >> called on a page with the incomplete-split flag set. It might encounter one >> when it moves right, but never the first page. > > Fair enough, but I don't think that affects correctness either way (I > don't think that you meant to imply that this was a necessary > precaution that you'd taken - right?). Right. >> If I understood correctly, the tree looks like this before the insertion: >> >> Parent page: >> +-------------+ >> | | >> | 9 -> A | >> +-------------+ >> >> Leaf A >> +-------------+ >> | HI-key: 9 | >> | | >> | 7 8 9 | >> +-------------+ >> >> And after insertion and incomplete split: >> >> Parent page >> +-------------+ >> | | >> | 9 -> A | >> +-------------+ >> >> Leaf A Leaf B >> +--------------+ +-------------+ >> | HI-key: 8 | | HI-key: 9 | >> | (INCOMPLETE_ | | | >> | SPLIT) | <-> | | >> | | | | >> | 7 7* 8 | | 9 | >> +--------------+ +-------------+ > >> After the split is finished, the tree looks like this: >> >> Parent page >> +-------------+ >> | 8 -> A | >> | 9 -> B | >> +-------------+ >> >> Leaf A Leaf B >> +-------------+ +-------------+ >> | HI-key: 8 | | HI-key: 9 | >> | | <-> | | >> | 7 7* 8 | | 9 | >> +-------------+ +-------------+ > > How did the parent page change between before and after the final > atomic operation (page split completion)? What happened to "9 -> A"? Ah, sorry, I got that wrong. The downlinks store the *low* key of the child page, not the high key as I depicted. Let me try again: On 05/27/2014 09:17 AM, Peter Geoghegan wrote: > Anyway, the concern is that there may be problems when we call > _bt_finish_split() with that left sibling locked thoughout (i.e. > finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and > itself has a right sibling from the incomplete split (which is usually > the value lock page's right-right sibling)). I'm not concerned about > performance, since as the comment says it ought to be an infrequent > occurrence. I also believe that there are no deadlock hazards. But > consider this scenario: > > * We insert the value 7 into an int4 unique index. We need to split > the leaf page. We run out of memory or something, and ours is an > incomplete page split. Our transaction is aborted. For the sake of > argument, suppose that there are also already a bunch of dead tuples > within the index with values of 7, 8 and 9. So, initially the tree looks like this: Parent page: +-------------+ | | | 7 -> A | +-------------+ Leaf A +-------------+ | HI-key: 9 | | | | 7 8 9 | +-------------+ And after insertion and incomplete split: Parent page +-------------+ | | | 7 -> A | +-------------+ Leaf A Leaf B +--------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | | 9 | +--------------+ +-------------+ where 7* is the newly inserted key, with value 7. (you didn't mention at what point the split happens, but in the next paragraph you said the new downlink has value 8, which implies the above split) > * Another inserter of the value 7 comes along. It follows exactly the > same downlink as the first, now aborted inserter (suppose the > downlink's value is 9). It also locks the same leaf page to establish > a "value lock" in precisely the same manner. Finding no room on the > first page, it looks further right, maintaining its original "value > lock" throughout. It finishes the first inserter's split of the > non-value-lock page - a new downlink is inserted into the parent page, > with the value 8. It then releases all buffer locks except the first > one/original "value lock". A physical insertion has yet to occur. The downlink of the original page cannot contain 9. Because, as I now remember ;-), the downlinks contain low-keys, not high keys. - Heikki
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Ah, sorry, I got that wrong. The downlinks store the *low* key of the child > page, not the high key as I depicted. Let me try again: Would you mind humoring me, and including a corrected final post-downlink-insert diagram, when the split is fully complete? You omitted that. -- Peter Geoghegan
On 05/27/2014 11:30 PM, Peter Geoghegan wrote: > On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child >> page, not the high key as I depicted. Let me try again: > > Would you mind humoring me, and including a corrected final > post-downlink-insert diagram, when the split is fully complete? You > omitted that. Sure: Parent page +-------------+ | 7 -> A | | 8 -> B | +-------------+ Leaf A Leaf B +--------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | | | | | | <-> | | | | | | | 7 7* 8 | | 9 | +--------------+ +-------------+ - Heikki
<div dir="ltr">Claudio Freire and I are proposing new functionality for Postgresql <br />to extend the scope of prefetchingand also exploit posix asynchronous IO<br />when doing prefetching, and have a patch based on 9.4dev<br />readyfor consideration.<br /><br />This topic has cropped up at irregular intervals over the years,<br />e.g. this threadback in 2012<br /> <a href="www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com" target="_blank">www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com</a><br />andthis thread more recently<br /> http://www.postgresql.org/message-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com<br/><br />We nowhave an implementation which gives useful performance improvement<br />as well as other advantages compared to what iscurrently available,<br />at least for certain environments.<br /><br />Below I am pasting the README we have written forthis new functionality<br />which mentions some of the measurements, advantages (and disadvantages)<br />and we welcomeall and any comments on this.<br /><br />I will send the patch to commitfest later, once this email is posted to hackers,<br/>so that anyone who wishes can try it, or apply directly to me if you wish.<br />The patch is currently basedon 9.4dev but a version based on 9.3.4<br />will be available soon if anyone wants that. The patch is large (43files)<br />so non-trivial to review, but any comments on it (when posted) will be<br />appreciated and acted on. Note that at present the only environment<br />in which it has been applied and tested is linux.<br /><br />John Lumby <br />__________________________________________________________________________________________________<br /><br/><br />Postgresql -- Extended Prefetching using Asynchronous IO<br />============================================================<br/><br />Postgresql currently (9.3.4) provides a limitedprefetching capability<br />using posix_fadvise to give hints to the Operating System kernel<br />about which pagesit expects to read in the near future.<br />This capability is used only during the heap-scan phase of bitmap-indexscans.<br />It is controlled via the effective_io_concurrency configuration parameter.<br /><br />This capabilityis now extended in two ways :<br /> . use asynchronous IO into Postgresql shared buffers as an<br /> alternative to posix_fadvise<br /> . Implement prefetching in other types of scan :<br /> . non-bitmap(i.e. simple) index scans - index pages<br /> currently only for B-tree indexes.<br /> (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)<br /> . non-bitmap (i.e.simple) index scans - heap pages<br /> currently only for B-tree indexes.<br /> . simple heap scans<br /><br />Posix asynchronous IO is chosen as the function library for asynchronous IO,<br/>since this is well supported and also fits very well with the model of<br />the prefetching process, particularlyas regards checking for completion<br />of an asynchronous read. On linux, Posix asynchronous IO is provided<br/>in the librt library. librt uses independently-schedulable threads to<br />achieve the asynchronicity, rather than kernel functionality.<br /><br />In this implementation, use of asynchronous IO is limitedto prefetching<br />while performing one of the three types of scan<br /> . B-tree bitmap index scan -heap pages (as already exists)<br /> . B-tree non-bitmap (i.e. simple) index scans - index and heap pages<br/> . simple heap scans<br />on permanent relations. It is not used on temporary tables nor for writes.<br/><br />The advantages of Posix asynchronous IO into shared buffers<br />compared to posix_fadvise are :<br /> . Beneficial for non-sequential access patterns as well as sequential<br /> . No restriction on the kinds of IOwhich can be used<br /> (other kinds of asynchronous IO impose restrictions such as<br /> buffer alignment, use of non-buffered IO).<br /> . Does not interfere with standard linux kernel read-ahead functionality.<br/> (It has been stated in <br /> www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com<br/> that :<br/> "the kernel stops doing read-ahead when a call to posix_fadvise comes.<br /> I noticed the performancehit, and checked the kernel's code.<br /> It effectively changes the prediction mode from sequentialto fadvise,<br /> negating the (assumed) kernel's prefetch logic")<br /> . When the read requestis issued after a prefetch has completed,<br /> no delay associated with a kernel call to copy the page from<br/> kernel page buffers into the Postgresql shared buffer,<br /> since it is already there.<br /> Also, in a memory-constrained environment, there is a greater<br /> probability that the prefetched pagewill "stick" in memory<br /> since the linux kernel victimizes the filesystem page cache in preference<br /> to swapping out user process pages.<br /> . Statistics on prefetch success can be gathered (see "Statistics"below)<br /> which helps the administrator to tune the prefetching settings.<br /><br />These benefitsare most likely to be obtained in a system whose usage profile<br />(e.g. from iostat) shows:<br /> . highIO wait from mostly-read activity<br /> . disk access pattern is not entirely sequential<br /> (so kernelreadahead can't predict it but postgresql can)<br /> . sufficient spare idle CPU to run the librt pthreads<br/> or, stated another way, the CPU subsystem is relatively powerful<br /> compared to thedisk subsystem.<br />In such ideal conditions, and with a workload with plenty of index scans,<br />around 10% - 20%improvement in throughput has been achieved.<br />In an admittedly extreme environment measured by this author, witha workload<br />consisting of 8 client applications each running similar complex queries<br />(same query structure butdifferent predicates and constants),<br />including 2 Bitmap Index Scans and 17 non-bitmap index scans,<br />on a dual-coreIntel laptop (4 hyperthreads) with the database on a single<br />USB3-attached 500GB disk drive, and no part ofthe database in filesystem buffers<br />initially, (filesystem freshly mounted), comparing unpatched build<br />usingposix_fadvise with effective_io_concurrency 4 against same build patched<br />with async IO and effective_io_concurrency4 and max_async_io_prefetchers 32,<br />elapse time repeatably improved from around 640-670 secondsto around 530-550 seconds,<br />a 17% - 18% improvement. <br /><br />The disadvantages of Posix asynchronous IO comparedto posix_fadvise are:<br /> . probably higher CPU utilization:<br /> Firstly, the extra work performedby the librt threads adds CPU<br /> overhead, and secondly, if the asynchronous prefetching is effective,<br/> then it will deliver better (greater) overlap of CPU with IO, which<br /> will reduce elapsedtimes and hence increase CPU utilization percentage<br /> still more (during that shorter elapsed time).<br/> . more context switching, because of the additional threads.<br /><br /><br />Statistics:<br />___________<br/><br />A number of additional statistics relating to effectiveness of asynchronous IO<br />are providedas an extension of the existing pg_stat_statements loadable module.<br />Refer to the appendix "Additional SuppliedModules" in the current<br />PostgreSQL Documentation for details of this module.<br /><br />The following additionalstatistics are provided for asynchronous IO prefetching:<br /><br /> . aio_read_noneed : number of prefetchesfor which no need for prefetch as block already in buffer pool<br /> . aio_read_discrd : number of prefetchesfor which buffer not subsequently read and therefore discarded<br /> . aio_read_forgot : number of prefetchesfor which buffer not subsequently read and then forgotten about<br /> . aio_read_noblok : number of prefetchesfor which no available BufferAiocb control block<br /> . aio_read_failed : number of aio reads for whichaio itself failed or the read failed with an errno<br /> . aio_read_wasted : number of aio reads for which in-progressaio cancelled and disk block not used<br /> . aio_read_waited : number of aio reads for which disk blockused but had to wait for it<br /> . aio_read_ontime : number of aio reads for which disk block used and readyon time when requested<br /><br />Some of these are (hopefully) self-explanatory. Some additional notes:<br /><br/> . aio_read_discrd and aio_read_forgot :<br /> prefetch was wasted work since the buffer wasnot subsequently read<br /> The discrd case indicates that the scanner realized this and discardedthe buffer,<br /> whereas the forgot case indicates that the scanner did not realize it,<br /> which should not normally occur.<br /> A high number in either suggests loweringeffective_io_concurrency.<br /><br /> . aio_read_noblok : <br /> Any significant numberin relation to all the other numbers indicates that<br /> max_async_io_prefetchers should be increased.<br/><br /> . aio_read_waited :<br /> The page was prefetched but the asynchronous readhad not completed by the time the<br /> scanner requested to read it. causes extra overhead inwaiting and indicates<br /> prefetching is not providing much if any benefit.<br /> The disk subsystem may be underpowered/overloaded in relation to the available CPU power.<br /><br /> . aio_read_ontime :<br /> The page was prefetched and the asynchronous read had completed by thetime the<br /> scanner requested to read it. Optimal behaviour. If this number if large<br/> in relation to all the other numbers except (possibly) aio_read_noneed,<br /> then prefetching is working well.<br /><br />To create the extension with support for these additionalstatistics, use the following syntax:<br /> CREATE EXTENSION pg_stat_statements VERSION '1.3'<br />or, ifyou run the new code against an existing database which already has the extension<br />( see installation and migrationbelow ), you can <br /> ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'<br /><br />A suggested set of commandsfor displaying these statistics might be :<br /><br /> /* OPTIONALLY */ DROP extension pg_stat_statements;<br /> CREATE extension pg_stat_statements VERSION '1.3';<br /> /* run your workload */<br /> select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read ,blk_read_time , blk_write_time \<br /> , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted, aio_read_waited , aio_read_ontime , aio_read_forgot \<br /> from pg_stat_statementswhere shared_blks_read > 0;<br /><br /><br />Installation and Build Configuration:<br />_____________________________________<br/><br />1. First - a prerequsite:<br /># as well as requiring all the usual packagebuild tools such as gcc , make etc,<br /># as described in the instructions for building postgresql,<br /># thefollowing is required :<br /> gnu autoconf at version 2.69 :<br /># run the following command<br />autoconf -V<br />#it *must* return<br />autoconf (GNU Autoconf) 2.69<br /><br />2. If you don't have it or it is a different version,<br/>then you must obtain version 2.69 (which is the current version)<br />from your distribution provider or fromthe gnu software download site.<br /><br />3. Also you must have the source tree for postgresql version 9.4 (developmentversion).<br /># all the following commands assume your current working directory is the top of the sourcetree.<br /><br />4. cd to top of source tree :<br /># check it appears to be a postgresql source tree<br />ls -ldconfigure.in src<br /># should show both the file and the directory<br />grep PostgreSQL COPYRIGHT<br /># should showPostgreSQL Database Management System<br /><br />5. Apply the patch :<br />patch -b -p0 -i <patch_file_path><br/># should report no errors, 42 files patched (see list at bottom of this README)<br /># andall hunks applied<br /># check the patch was appplied to configure.in<br />ls -ld configure.in.orig configure.in<br /># should show both files<br /><br />6. Rebuild the configure script with the patched configure.in :<br />mv configureconfigure.orig;<br />autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;<br />ls-lrt configure.orig configure;<br /><br />7. run the new configure script :<br /># if you have run configure before,<br/># then you may first want to save existing config.status and config.log if they exist,<br /># and then specifysame configure flags and options as you specified before.<br /># the patch does not alter or extend the set of configureoptions<br /># if unsure, run ./configure --help<br /># if still unsure, run ./configure<br />./configure<other configure options as desired><br /><br /><br /><br />8. now check that configure decided that thisenvironment supports asynchronous IO :<br />grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h<br /># itshould show<br />#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1<br /># if not, apparently your environment does not supportasynch IO -<br /># the config.log will show how it came to that conclusion,<br /># also check for :<br /># .a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)<br /># . your gcc must supportthe atomic compare_and_swap __sync_bool_compare_and_swap built-in function<br /># do not proceed without this definebeing set.<br /><br />9. do you want to use the new code on an existing cluster<br /> that was created using thesame code base but without the patch?<br /> If so then run this nasty-looking command :<br /> (cut-and-paste it intoa terminal window or a shell-script file)<br /> Otherwise continue to step 10.<br /> see Migration note below forexplanation.<br />###############################################################################################<br /> fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {<br /> bkfl="${fl}.bak${bkx}"; if [[ -a${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;<br /> }; done;<br /> if [[ -a ${bkfl} ]]; then echo "sorry cannot finda backup name for $fl";<br /> elif [[ -a $fl ]]; then {<br /> mv $fl $bkfl && {<br /> sed -e"/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;<br /> str="diff -w $bkfl $fl";echo"$str"; eval "$str";<br /> };<br /> };<br /> else echo "ooopppss $fl is missing";<br /> fi;<br />###############################################################################################<br/># it should reportsomething like<br />diff -w Makefile.global.bak0 Makefile.global<br />222c222<br />< CFLAGS = XXXX<br />---<br />>CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br /># where XXXX is some set of flags<br /><br /><br />10. nowrun the rest of the build process as usual -<br /> follow instructions in file INSTALL if that file exists,<br /> else e.g. run<br />make && make install<br /><br />If the build fails with the following error:<br />undefinedreference to `aio_init'<br />Then edit the following file<br />src/include/pg_config_manual.h<br />and add thefollowing line at the bottom:<br /><br />#define DONT_HAVE_AIO_INIT<br /><br />and then run<br />make clean &&make && make install<br />See notes to section Runtime Configuration below for more information on this.<br/><br /><br /><br />Migration , Runtime Configuration, and Use:<br />___________________________________________<br/><br /><br />Database Migration:<br />___________________<br /><br />Thenew prefetching code for non-bitmap index scans introduces a new btree-index<br />function named btpeeknexttuple. The correct way to add such a function involves<br />also adding it to the catalog as an internal functionin pg_proc.<br />However, this results in the new built code considering an existing database to be<br />incompatible, i.e requiring backup on the old code and restore on the new.<br />This is normal behaviour for migrationto a new version of postgresql, and is<br />also a valid way of migrating a database for use with this asynchronousIO feature,<br />but in this case it may be inconvenient.<br /><br />As an alternative, the new code may becompiled with the macro define<br />AVOID_CATALOG_MIGRATION_FOR_ASYNCIO<br />which does what it says by not altering thecatalog. The patched build can then<br />be run against an existing database cluster initdb'd using the unpatched build.<br/><br />There are no known ill-effects of so doing, but :<br /> . in any case, it is strongly suggested tomake a backup of any precious database<br /> before accessing it with a patched build<br /> . be aware thatif this asynchronous IO feature is eventually released as part of postgresql,<br /> migration will probably berequired anyway.<br /><br />This option to avoid catalog migration is intended as a convenience for a quick test,<br />andalso makes it easier to obtain performance comparisons on the same database.<br /><br /><br /><br />Runtime Configuration:<br/>______________________<br /><br />One new configuration parameter settable in postgresql.conf and<br />inany other way as described in the postgresql documentation :<br /><br />max_async_io_prefetchers<br /> Maximum numberof background processes concurrently using asynchronous<br /> librt threads to prefetch pages into shared memory buffers<br/><br />This number can be thought of as the maximum number<br />of librt threads concurrently active, each workingon a list of<br />from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).<br /><br />In practice, this numbersimply controls how many prefetch requests in total<br />may be active concurrently :<br /> max_async_io_prefetchers* target_prefetch_pages ( see note 1)<br /><br />default is max_connections/6<br />and recall thatthe default for max_connections is 100<br /><br /><br />note 1 a number based on effective_io_concurrency and approximatelyn * ln(n)<br /> where n is effective_io_concurrency<br /><br />note 2 Provided that the gnu extensionto Posix AIO which provides the<br />aio_init() function is present, then aio_init() is called<br />to set thelibrt maximum number of threads to max_async_io_prefetchers,<br />and to set the maximum number of concurrent aio readrequests to the product of<br /> max_async_io_prefetchers * target_prefetch_pages<br /><br /><br />As well asthis regular configuration parameter,<br />there are several other parameters that can be set via environment variable.<br/>The reason why they are environment vars rather than regular configuration parameters<br />is that it is notexpected that they should need to be set, but they may be useful :<br /> variable name values default meaning<br /> PG_TRY_PREFETCHING_FOR_BITMAP [Y|N] Y whether to prefetch bitmap heap scans<br /> PG_TRY_PREFETCHING_FOR_ISCAN [Y|N|integer[,[N|Y]]] 256,N whether to prefetch non-bitmap index scans<br /> also numeric size of list of prefetched blocks<br /> also whether to prefetch forward-sequential-patternindex pages<br /> PG_TRY_PREFETCHING_FOR_BTREE [Y|N] Y whetherto prefetch heap pages in non-bitmap index scans<br /> PG_TRY_PREFETCHING_FOR_HEAP [Y|N] N whether to prefetch relation (un-indexed) heap scans<br /><br /><br />The setting for PG_TRY_PREFETCHING_FOR_ISCANis a litle complicated.<br />It can be set to Y or N to control prefetching of non-bitmap indexscans;<br />But in addition it can be set to an integer, which both implies Y<br />and also sets the size of a listused to remember prefetched but unread heap pages.<br />This list is an optimization used to avoid re-prefetching andmaximise the potential<br />set of prefetchable blocks indexed by one index page.<br />And if set to an integer, thisinteger may be followed by either ,Y or ,N<br />to specify to prefetch index pages which are being accessed forward-sequentially.<br/>It has been found that prefetching is not of great benefit for this access pattern,<br />and soit is not the default, but also does no harm (provided sufficient CPU capacity).<br /><br /><br /><br />Usage :<br />______<br/><br /><br />There are no changes in usage other than as noted under Configuration and Statistics.<br />However, in order to assess benefit from this feature, it will be useful to<br />understand the query access plans ofyour workload using EXPLAIN. Before doing that,<br />make sure that statistics are up to date using ANALYZE.<br /><br/><br /><br />Internals:<br />__________<br /><br /><br />Internal changes span two areas and the interface betweenthem :<br /><br /> . buffer manager layer<br /> . programming interface for scanner to call buffer manager<br /> . scanner layer<br /><br /> . buffer manager layer<br /> ____________________<br /><br /> changes comprise :<br/> . allocating, pinning , unpinning buffers<br /> this is complex and discussed briefly below in"Buffer Management"<br /> . acquiring and releasing a BufferAiocb, the control block<br /> associatedwith a single aio_read, and checking for its completion<br /> a new file, backend/storage/buffer/buf_async.c,provides three new functions,<br /> BufStartAsync BufReleaseAsync BufCheckAsync<br /> which handle this.<br /> . calling librt asynch io functions<br/> this follows the example of all other filesystem interfaces<br /> and is straightforward. <br /> two new functions are provided in fd.c:<br /> FileStartaio FileCompleteaio<br /> and corresponding interfaces in smgr.c<br /><br /> . programming interfacefor scanner to call buffer manager<br /> ________________________________________________________<br /> . calling interface for existing function PrefetchBuffer is modified :<br /> . one new argument, BufferAccessStrategystrategy<br /> . now returns an int return code which indicates :<br /> whether pin count on buffer has been increased by 1<br /> whether block was alreadypresent in a buffer<br /> . new function DiscardBuffer<br /> . discard buffer used for a previouslyprefetched page<br /> which scanner decides it does not want to read.<br /> . same argumentsas for PrefetchBuffer except for omission of BufferAccessStrategy<br /> . note - this is different fromthe existing function ReleaseBuffer<br /> in that ReleaseBuffer takes a buffer_descriptor as argument<br/> for a buffer which has been read, but has similar purpose.<br /><br /> . scanner layer<br/> _____________<br /> common to all scanners is that the scanner which wishes to prefetch must do twothings:<br /> . decide which pages to prefetch and call PrefetchBuffer to prefetch them<br /> nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)<br /> . remember which pages it has prefetched in some list (actual or conceptual, e.g. a page range),<br /> removingeach page from this list if and when it subsequently reads the page.<br /> . at end of scan, call DiscardBufferfor every remembered (i.e. prefetched not unread) page<br /> how this list of prefetched pages is implementedvaries for each of the three scanners and four scan types:<br /> . bitmap index scan - heap pages<br/> . non-bitmap (i.e. simple) index scans - index pages<br /> . non-bitmap (i.e. simple)index scans - heap pages<br /> . simple heap scans<br /> The consequences of forgetting to callDiscardBuffer on a prefetched but unread page are:<br /> . counted in aio_read_forgot (see "Statistics"above)<br /> . may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "<br /> (the buffer is released at commit)<br /> This does sometimes happen ...<br /> <br /><br /><br/>Buffer Management<br />_________________<br /><br />With async io, PrefetchBuffer must allocate and pin a buffer, which is relatively straightforward,<br />but also every other part of buffer manager must know about the possibilitythat a buffer may be in<br />a state of async_io_in_progress state and be prepared to determine the possible completion.<br/>That is, one backend BK1 may start the io but another BK2 may try to read it before BK1 does.<br />PosixAsynchronous IO provides a means for waiting on this or another task's read if in progress,<br />namely aio_suspend(), which this extension uses. Therefore, although StartBufferIO and TerminateBufferIO<br />are called aspart of asynchronous prefetching, their role is limited to maintaining the buffer descriptor flags,<br />and they donot track the asynchronous IO itself. Instead, asynchronous IOs are tracked in<br />a separate set of shared controlblocks, the BufferAiocb list -<br />refer to include/storage/buf_internals.h<br />Checking asynchronous io statusis handled in backend/storage/buffer/buf_async.c BufCheckAsync function.<br />Read the commentary for this functionfor more details.<br /><br />Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,<br/>and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.<br />Whena backend BK2 requests ReadBuffer of a page for which asynch read is in progress,<br />buffer manager has to determinewhich backend BK1 pinned this buffer during previous PrefetchBuffer,<br />and for example must not be re-pinneda second time if BK2 is BK1.<br />Information concerning which backend initiated the prefetch is held in the BufferAiocb.<br/><br />The trickiest case concerns the scenario in which :<br /> . BK1 initiates prefetch and acquiresa pin<br /> . BK2 possibly waits for completion and then reads the buffer, and perhaps later on<br /> releases it by ReleaseBuffer.<br /> . Since the asynchronous IO is no longer in progress, there is no longerany<br /> BufferAiocb associated with it. Yet buffer manager must remember that BK1 holds a<br /> "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.<br /> . Thesolution to this problem is to invent the concept of a "banked" pin,<br /> which is a pin obtained when prefetchwas issued, identied as in "banked" status only if and when<br /> the associated asynchronous IO terminates, and redeemable by the next use by same task,<br /> either by ReadBuffer or DiscardBuffer.<br /> Thepid of the backend which holds a banked pin on a buffer (there can be at most one such backend)<br /> is stored inthe buffer descriptor.<br /> This is done without increasing size of the buffer descriptor, which is important since<br/> there may be a very large number of these. This does overload the relevant field in the descriptor.<br/> Refer to include/storage/buf_internals.h for more details<br /> and search for BM_AIO_PREFETCH_PIN_BANKEDin storage/buffer/bufmgr.c and backend/storage/buffer/buf_async.c<br /><br />______________________________________________________________________________<br/>The following 43 files are changed inthis feature (output of the patch command) :<br /><br />patching file configure.in<br />patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql<br/>patching file contrib/pg_stat_statements/Makefile<br />patchingfile contrib/pg_stat_statements/pg_stat_statements.c<br />patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql<br/>patching file config/c-library.m4<br />patching file src/backend/postmaster/postmaster.c<br/>patching file src/backend/executor/nodeBitmapHeapscan.c<br />patching file src/backend/executor/nodeIndexscan.c<br/>patching file src/backend/executor/instrument.c<br />patching file src/backend/storage/buffer/Makefile<br/>patching file src/backend/storage/buffer/bufmgr.c<br />patching file src/backend/storage/buffer/buf_async.c<br/>patching file src/backend/storage/buffer/buf_init.c<br />patching file src/backend/storage/smgr/md.c<br/>patching file src/backend/storage/smgr/smgr.c<br />patching file src/backend/storage/file/fd.c<br/>patching file src/backend/storage/lmgr/proc.c<br />patching file src/backend/access/heap/heapam.c<br/>patching file src/backend/access/heap/syncscan.c<br />patching file src/backend/access/index/indexam.c<br/>patching file src/backend/access/index/genam.c<br />patching file src/backend/access/nbtree/nbtsearch.c<br/>patching file src/backend/access/nbtree/nbtinsert.c<br />patching file src/backend/access/nbtree/nbtpage.c<br/>patching file src/backend/access/nbtree/nbtree.c<br />patching file src/backend/nodes/tidbitmap.c<br/>patching file src/backend/utils/misc/guc.c<br />patching file src/backend/utils/mmgr/aset.c<br/>patching file src/include/executor/instrument.h<br />patching file src/include/storage/bufmgr.h<br/>patching file src/include/storage/smgr.h<br />patching file src/include/storage/fd.h<br/>patching file src/include/storage/buf_internals.h<br />patching file src/include/catalog/pg_am.h<br/>patching file src/include/catalog/pg_proc.h<br />patching file src/include/pg_config_manual.h<br/>patching file src/include/access/nbtree.h<br />patching file src/include/access/heapam.h<br/>patching file src/include/access/relscan.h<br />patching file src/include/nodes/tidbitmap.h<br/>patching file src/include/utils/rel.h<br />patching file src/include/pg_config.h.in<br/><br /><br />Future Possibilities:<br />____________________<br /><br />There are several possibleextensions of this feature :<br /> . Extend prefetching of index scans to types of index<br /> other thanB-tree.<br /> This should be fairly straightforward, but requires some<br /> good base of benchmarkableworkloads to prove the value.<br /> . Investigate why asynchronous IO prefetching does not greatly<br /> improve sequential relation heap scans and possibly find how to<br /> achieve a benefit.<br /> . Buildknowledge of asycnhronous IO prefetching into the<br /> Query Planner costing.<br /> This is far from straightforward. The Postgresql Query Planner's<br /> costing model is based on resource consumption rather thanelapsed time.<br /> Use of asynchronous IO prefetching is intended to improve elapsed time<br /> as the expenseof (probably) higher resource consumption.<br /> Although Costing understands about the reduced cost of readingbuffered<br /> blocks, it does not take asynchronicity or overlap of CPU with disk<br /> into account. A naive approach might be to try to tweak the Query<br /> Planner's Cost Constant configuration parameters<br/> such as seq_page_cost , random_page_cost<br /> but this is hazardous as explained in the Documentation.<br/><br /><br /><br />John Lumby, johnlumby(at)hotmail(dot)com<br /><br /></div>
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Fair enough, but I don't think that affects correctness either way (I >> don't think that you meant to imply that this was a necessary >> precaution that you'd taken - right?). > > > Right. So, the comments within _bt_search() suggest that the _bt_moveright() call will perform a _bt_finish_split() call opportunistically iff it's called from _bt_doinsert() (i.e. access == BT_WRITE). There is no reason to not do so in all circumstances though, assuming that it's desirable to do so as soon as possible (which I *don't* actually assume). If I'm not mistaken, it's also true that it would be strictly speaking correct to never do it there. Do you think it would be a fair stress-test if I was to hack Postgres so that this call never happens (within _bt_moveright())? I'd also have an incomplete page split occur at random with a probability of, say, 1% per split. The mechanism would be much the same as your original test-case for the split patch - I'd throw an error at the wrong place, although only 1% of the time, and over many hours. The concern with the _bt_moveright() call of _bt_finish_split() is that it might conceal a problem without reliably fixing it, potentially making isolating that problem much harder. -- Peter Geoghegan
On 05/28/2014 02:15 AM, Peter Geoghegan wrote: > On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> Fair enough, but I don't think that affects correctness either way (I >>> don't think that you meant to imply that this was a necessary >>> precaution that you'd taken - right?). >> >> Right. > > So, the comments within _bt_search() suggest that the _bt_moveright() > call will perform a _bt_finish_split() call opportunistically iff it's > called from _bt_doinsert() (i.e. access == BT_WRITE). There is no > reason to not do so in all circumstances though, assuming that it's > desirable to do so as soon as possible (which I *don't* actually > assume). You can't write in a hot standby, so that's one reason to only do it when inserting, not when querying. Even when you're not in a hot standby, it seems like a nice property that a read-only query doesn't do writes. I know we do hint bit updates and other kinds of write-action when reading anyway, but still. > If I'm not mistaken, it's also true that it would be strictly > speaking correct to never do it there. Hmm. No, it wouldn't. It is not allowed to have two incompletely-split pages adjacent to each other. If you move right to the right-half of an incomplete split, i.e. a page that does not have a downlink in its parent, and then try to split the page again, _bt_insert_parent() would fail to find the location to insert the new downlink to. It assumes that there is a downlink to the page being split in the parent, and uses that to find the location for the new downlink. - Heikki
On Tue, May 27, 2014 at 3:17 PM, John Lumby <johnlumby@hotmail.com> wrote: > Below I am pasting the README we have written for this new functionality > which mentions some of the measurements, advantages (and disadvantages) > and we welcome all and any comments on this. I think that this is likely to be a useful area to work on, but I wonder: can you suggest a useful test-case or benchmark to show the advantages of the patch you posted? You mention a testcase already, but it's a little short on details. I think it's always a good idea to start with that when pursuing a performance feature. Have you thought about things like specialized prefetching for nested loop joins? -- Peter Geoghegan
On Wed, May 28, 2014 at 6:51 PM, Peter Geoghegan <pg@heroku.com> wrote: > Have you thought about things like specialized prefetching for nested > loop joins? Currently, such a thing would need some non-trivial changes to the execution nodes, I believe. For nestloop, correct me if I'm wrong, but index scan nodes don't have visibility of the next tuple to be searched for.
On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > For nestloop, correct me if I'm wrong, but index scan nodes don't have > visibility of the next tuple to be searched for. Nested loop joins are considered a particularly compelling case for prefetching, actually. -- Peter Geoghegan
I have pasted below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time (and it sounds as though Claudio is on it too).
The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.
John
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time (and it sounds as though Claudio is on it too).
The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.
John
> Date: Wed, 28 May 2014 18:12:23 -0700
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: pg@heroku.com
> To: klaussfreire@gmail.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
>
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.
>
> --
> Peter Geoghegan
____________________________________________________________________________________-
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=801294.81..801294.81 rows=2 width=532)
CTE deploy_zone_down
-> Recursive Union (cost=1061.25..2687.40 rows=11 width=573)
-> Nested Loop (cost=1061.25..1423.74 rows=1 width=41)
-> Nested Loop (cost=1061.11..1421.22 rows=14 width=49)
-> Bitmap Heap Scan on entity zone_tree (cost=1060.67..1175.80 rows=29 width=40)
Recheck Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
-> BitmapAnd (cost=1060.67..1060.67 rows=29 width=0)
-> Bitmap Index Scan on entity_name (cost=0.00..139.71 rows=5927 width=0)
Index Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
-> Bitmap Index Scan on entity_discriminatorx (cost=0.00..920.70 rows=49636 width=0)
Index Cond: ((discriminator)::text = 'ZONE'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv (cost=0.43..8.45 rows=1 width=17)
Index Cond: (owner_id = zone_tree.id)
-> Index Scan using metadata_field_pkey on metadata_field mf (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv.field_id)
Filter: ((name)::text = 'deployable'::text)
-> Nested Loop (cost=0.87..126.34 rows=1 width=573)
-> Nested Loop (cost=0.72..125.44 rows=5 width=581)
-> Nested Loop (cost=0.29..83.42 rows=10 width=572)
-> WorkTable Scan on deploy_zone_down dzd (cost=0.00..0.20 rows=10 width=540)
-> Index Scan using entity_discriminator_parent_zone on entity ch (cost=0.29..8.31 rows=1 width=40)
Index Cond: ((parent_id = dzd.zone_tree_id) AND ((discriminator)::text = 'ZONE'::text))
-> Index Scan using metadata_value_owner_id on metadata_value mv_1 (cost=0.43..4.19 rows=1 width=17)
Index Cond: (owner_id = ch.id)
-> Index Scan using metadata_field_pkey on metadata_field mf_1 (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv_1.field_id)
Filter: ((name)::text = 'deployable'::text)
CTE deploy_zone_tree
-> Recursive Union (cost=0.00..9336.89 rows=21 width=1105)
-> CTE Scan on deploy_zone_down dzd_1 (cost=0.00..0.22 rows=11 width=1105)
-> Nested Loop (cost=0.43..933.63 rows=1 width=594)
-> WorkTable Scan on deploy_zone_tree dzt (cost=0.00..2.20 rows=110 width=581)
-> Index Scan using entity_pkey on entity pt (cost=0.43..8.46 rows=1 width=21)
Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
Filter: ((discriminator)::text = ANY ('{VIEW,ZONE}'::text[]))
CTE forward_host_ip
-> Nested Loop (cost=1.30..149.65 rows=24 width=88)
-> Nested Loop (cost=0.87..121.69 rows=48 width=56)
-> Nested Loop (cost=0.43..71.61 rows=99 width=48)
-> CTE Scan on deploy_zone_tree dzt_1 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_parent_id on entity host (cost=0.43..70.14 rows=99 width=40)
Index Cond: (parent_id = dzt_1.zone_tree_id)
Filter: ((discriminator)::text = 'HOST'::text)
-> Index Scan using entity_link_owner_id on entity_link link (cost=0.43..0.50 rows=1 width=16)
Index Cond: (owner_id = host.id)
Filter: ((link_type)::text = ANY ('{IP,IP6}'::text[]))
-> Index Scan using entity_pkey on entity address (cost=0.43..0.57 rows=1 width=40)
Index Cond: (id = link.entity_id)
Filter: ((discriminator)::text = ANY ('{IP4A,IP6A}'::text[]))
CTE association_view
-> Nested Loop (cost=0.87..26.29 rows=1 width=75)
-> Nested Loop (cost=0.43..17.82 rows=1 width=56)
-> CTE Scan on deploy_zone_tree dzt_2 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record (cost=0.43..17.34 rows=1 width=48)
Index Cond: ((parent_id = dzt_2.zone_tree_id) AND ((discriminator)::text = ANY ('{C,MX,SRV}'::text[])))
-> Index Scan using entity_pkey on entity assoc (cost=0.43..8.46 rows=1 width=27)
Index Cond: (id = record.association_id)
CTE simple_view
-> Nested Loop (cost=0.43..22.27 rows=1 width=48)
-> CTE Scan on deploy_zone_tree dzt_3 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record_1 (cost=0.43..21.79 rows=1 width=40)
Index Cond: ((parent_id = dzt_3.zone_tree_id) AND ((discriminator)::text = ANY ('{TXT,HINFO,GENRR,NAPTR}'::text[])))
CTE max_hist_id
-> Result (cost=0.48..0.49 rows=1 width=0)
InitPlan 6 (returns $19)
-> Limit (cost=0.43..0.48 rows=1 width=8)
-> Index Only Scan Backward using entity_history_history_id on entity_history xh (cost=0.43..444052.51 rows=10386347 width=8)
Index Cond: (history_id IS NOT NULL)
CTE relevant_history
-> Nested Loop (cost=0.43..199661.39 rows=3438689 width=28)
-> CTE Scan on max_hist_id xh_1 (cost=0.00..0.02 rows=1 width=8)
-> Index Scan using entity_history_history_id on entity_history eh (cost=0.43..156677.76 rows=3438689 width=20)
Index Cond: (history_id > xh_1.history_id)
Filter: (transaction_type = 'I'::bpchar)
CTE resource_records
-> Unique (cost=580178.30..584992.46 rows=160472 width=1063)
-> Sort (cost=580178.30..580579.48 rows=160472 width=1063)
Sort Key: fip.host_id, fip.host_discriminator, fip.host_parent_id, fip.view_id, ((fip.address_id)::text), fip.host_name, ((fip.address_long1)::text), (((fip.host_long1 & 1::bigint))::text), ((fip.address_parent_id)::text), ((((''::text || (COALESCE(mv_2.longnumber, (-1)::bigint))::text) || ','::text) || (fip.address_discriminator)::text)), rh.long1
-> Append (cost=203.82..417112.92 rows=160472 width=1063)
-> Hash Join (cost=203.82..91844.90 rows=137548 width=1136)
Hash Cond: ((rh.hist_discrim)::text = (fip.address_discriminator)::text)
Join Filter: (rh.history_delta > fip.host_id)
-> CTE Scan on relevant_history rh (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=203.52..203.52 rows=24 width=1128)
-> Nested Loop Left Join (cost=0.43..203.52 rows=24 width=1128)
-> CTE Scan on forward_host_ip fip (cost=0.00..0.48 rows=24 width=1120)
-> Index Scan using metadata_value_owner_id on metadata_value mv_2 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = fip.host_id)
-> Nested Loop (cost=0.43..77722.85 rows=5731 width=644)
Join Filter: (rh_1.history_delta > av.record_id)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=636)
-> CTE Scan on association_view av (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = 'C'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_3 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = av.record_id)
-> CTE Scan on relevant_history rh_1 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'C'::text)
-> Hash Join (cost=0.04..83402.53 rows=5731 width=636)
Hash Cond: ((rh_2.hist_discrim)::text = (av_1.record_discriminator)::text)
Join Filter: (rh_2.history_delta > av_1.record_id)
-> CTE Scan on relevant_history rh_2 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=628)
-> CTE Scan on association_view av_1 (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = ANY ('{MX,SRV}'::text[]))
-> Nested Loop (cost=0.86..79164.06 rows=5731 width=618)
Join Filter: (rh_3.history_delta > sv.record_id)
-> Nested Loop Left Join (cost=0.86..16.94 rows=1 width=610)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=588)
-> CTE Scan on simple_view sv (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = 'TXT'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_4 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = sv.record_id)
-> Index Scan using metadata_value_owner_id on metadata_value txtvalue (cost=0.43..8.45 rows=1 width=38)
Index Cond: (owner_id = sv.record_id)
-> CTE Scan on relevant_history rh_3 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'TXT'::text)
-> Hash Join (cost=0.04..83373.87 rows=5731 width=588)
Hash Cond: ((rh_4.hist_discrim)::text = (sv_1.record_discriminator)::text)
Join Filter: (rh_4.history_delta > sv_1.record_id)
-> CTE Scan on relevant_history rh_4 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=580)
-> CTE Scan on simple_view sv_1 (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = ANY ('{HINFO,GENRR,NAPTR}'::text[]))
-> Sort (cost=4417.98..4418.48 rows=200 width=532)
Sort Key: resource_records.discrim
-> HashAggregate (cost=4412.98..4415.98 rows=200 width=532)
Group Key: resource_records.discrim
-> CTE Scan on resource_records (cost=0.00..3209.44 rows=160472 width=532)
Planning time: 6.620 ms
(133 rows)
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: pg@heroku.com
> To: klaussfreire@gmail.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
>
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.
>
> --
> Peter Geoghegan
____________________________________________________________________________________-
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=801294.81..801294.81 rows=2 width=532)
CTE deploy_zone_down
-> Recursive Union (cost=1061.25..2687.40 rows=11 width=573)
-> Nested Loop (cost=1061.25..1423.74 rows=1 width=41)
-> Nested Loop (cost=1061.11..1421.22 rows=14 width=49)
-> Bitmap Heap Scan on entity zone_tree (cost=1060.67..1175.80 rows=29 width=40)
Recheck Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
-> BitmapAnd (cost=1060.67..1060.67 rows=29 width=0)
-> Bitmap Index Scan on entity_name (cost=0.00..139.71 rows=5927 width=0)
Index Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
-> Bitmap Index Scan on entity_discriminatorx (cost=0.00..920.70 rows=49636 width=0)
Index Cond: ((discriminator)::text = 'ZONE'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv (cost=0.43..8.45 rows=1 width=17)
Index Cond: (owner_id = zone_tree.id)
-> Index Scan using metadata_field_pkey on metadata_field mf (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv.field_id)
Filter: ((name)::text = 'deployable'::text)
-> Nested Loop (cost=0.87..126.34 rows=1 width=573)
-> Nested Loop (cost=0.72..125.44 rows=5 width=581)
-> Nested Loop (cost=0.29..83.42 rows=10 width=572)
-> WorkTable Scan on deploy_zone_down dzd (cost=0.00..0.20 rows=10 width=540)
-> Index Scan using entity_discriminator_parent_zone on entity ch (cost=0.29..8.31 rows=1 width=40)
Index Cond: ((parent_id = dzd.zone_tree_id) AND ((discriminator)::text = 'ZONE'::text))
-> Index Scan using metadata_value_owner_id on metadata_value mv_1 (cost=0.43..4.19 rows=1 width=17)
Index Cond: (owner_id = ch.id)
-> Index Scan using metadata_field_pkey on metadata_field mf_1 (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv_1.field_id)
Filter: ((name)::text = 'deployable'::text)
CTE deploy_zone_tree
-> Recursive Union (cost=0.00..9336.89 rows=21 width=1105)
-> CTE Scan on deploy_zone_down dzd_1 (cost=0.00..0.22 rows=11 width=1105)
-> Nested Loop (cost=0.43..933.63 rows=1 width=594)
-> WorkTable Scan on deploy_zone_tree dzt (cost=0.00..2.20 rows=110 width=581)
-> Index Scan using entity_pkey on entity pt (cost=0.43..8.46 rows=1 width=21)
Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
Filter: ((discriminator)::text = ANY ('{VIEW,ZONE}'::text[]))
CTE forward_host_ip
-> Nested Loop (cost=1.30..149.65 rows=24 width=88)
-> Nested Loop (cost=0.87..121.69 rows=48 width=56)
-> Nested Loop (cost=0.43..71.61 rows=99 width=48)
-> CTE Scan on deploy_zone_tree dzt_1 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_parent_id on entity host (cost=0.43..70.14 rows=99 width=40)
Index Cond: (parent_id = dzt_1.zone_tree_id)
Filter: ((discriminator)::text = 'HOST'::text)
-> Index Scan using entity_link_owner_id on entity_link link (cost=0.43..0.50 rows=1 width=16)
Index Cond: (owner_id = host.id)
Filter: ((link_type)::text = ANY ('{IP,IP6}'::text[]))
-> Index Scan using entity_pkey on entity address (cost=0.43..0.57 rows=1 width=40)
Index Cond: (id = link.entity_id)
Filter: ((discriminator)::text = ANY ('{IP4A,IP6A}'::text[]))
CTE association_view
-> Nested Loop (cost=0.87..26.29 rows=1 width=75)
-> Nested Loop (cost=0.43..17.82 rows=1 width=56)
-> CTE Scan on deploy_zone_tree dzt_2 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record (cost=0.43..17.34 rows=1 width=48)
Index Cond: ((parent_id = dzt_2.zone_tree_id) AND ((discriminator)::text = ANY ('{C,MX,SRV}'::text[])))
-> Index Scan using entity_pkey on entity assoc (cost=0.43..8.46 rows=1 width=27)
Index Cond: (id = record.association_id)
CTE simple_view
-> Nested Loop (cost=0.43..22.27 rows=1 width=48)
-> CTE Scan on deploy_zone_tree dzt_3 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record_1 (cost=0.43..21.79 rows=1 width=40)
Index Cond: ((parent_id = dzt_3.zone_tree_id) AND ((discriminator)::text = ANY ('{TXT,HINFO,GENRR,NAPTR}'::text[])))
CTE max_hist_id
-> Result (cost=0.48..0.49 rows=1 width=0)
InitPlan 6 (returns $19)
-> Limit (cost=0.43..0.48 rows=1 width=8)
-> Index Only Scan Backward using entity_history_history_id on entity_history xh (cost=0.43..444052.51 rows=10386347 width=8)
Index Cond: (history_id IS NOT NULL)
CTE relevant_history
-> Nested Loop (cost=0.43..199661.39 rows=3438689 width=28)
-> CTE Scan on max_hist_id xh_1 (cost=0.00..0.02 rows=1 width=8)
-> Index Scan using entity_history_history_id on entity_history eh (cost=0.43..156677.76 rows=3438689 width=20)
Index Cond: (history_id > xh_1.history_id)
Filter: (transaction_type = 'I'::bpchar)
CTE resource_records
-> Unique (cost=580178.30..584992.46 rows=160472 width=1063)
-> Sort (cost=580178.30..580579.48 rows=160472 width=1063)
Sort Key: fip.host_id, fip.host_discriminator, fip.host_parent_id, fip.view_id, ((fip.address_id)::text), fip.host_name, ((fip.address_long1)::text), (((fip.host_long1 & 1::bigint))::text), ((fip.address_parent_id)::text), ((((''::text || (COALESCE(mv_2.longnumber, (-1)::bigint))::text) || ','::text) || (fip.address_discriminator)::text)), rh.long1
-> Append (cost=203.82..417112.92 rows=160472 width=1063)
-> Hash Join (cost=203.82..91844.90 rows=137548 width=1136)
Hash Cond: ((rh.hist_discrim)::text = (fip.address_discriminator)::text)
Join Filter: (rh.history_delta > fip.host_id)
-> CTE Scan on relevant_history rh (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=203.52..203.52 rows=24 width=1128)
-> Nested Loop Left Join (cost=0.43..203.52 rows=24 width=1128)
-> CTE Scan on forward_host_ip fip (cost=0.00..0.48 rows=24 width=1120)
-> Index Scan using metadata_value_owner_id on metadata_value mv_2 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = fip.host_id)
-> Nested Loop (cost=0.43..77722.85 rows=5731 width=644)
Join Filter: (rh_1.history_delta > av.record_id)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=636)
-> CTE Scan on association_view av (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = 'C'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_3 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = av.record_id)
-> CTE Scan on relevant_history rh_1 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'C'::text)
-> Hash Join (cost=0.04..83402.53 rows=5731 width=636)
Hash Cond: ((rh_2.hist_discrim)::text = (av_1.record_discriminator)::text)
Join Filter: (rh_2.history_delta > av_1.record_id)
-> CTE Scan on relevant_history rh_2 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=628)
-> CTE Scan on association_view av_1 (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = ANY ('{MX,SRV}'::text[]))
-> Nested Loop (cost=0.86..79164.06 rows=5731 width=618)
Join Filter: (rh_3.history_delta > sv.record_id)
-> Nested Loop Left Join (cost=0.86..16.94 rows=1 width=610)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=588)
-> CTE Scan on simple_view sv (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = 'TXT'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_4 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = sv.record_id)
-> Index Scan using metadata_value_owner_id on metadata_value txtvalue (cost=0.43..8.45 rows=1 width=38)
Index Cond: (owner_id = sv.record_id)
-> CTE Scan on relevant_history rh_3 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'TXT'::text)
-> Hash Join (cost=0.04..83373.87 rows=5731 width=588)
Hash Cond: ((rh_4.hist_discrim)::text = (sv_1.record_discriminator)::text)
Join Filter: (rh_4.history_delta > sv_1.record_id)
-> CTE Scan on relevant_history rh_4 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=580)
-> CTE Scan on simple_view sv_1 (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = ANY ('{HINFO,GENRR,NAPTR}'::text[]))
-> Sort (cost=4417.98..4418.48 rows=200 width=532)
Sort Key: resource_records.discrim
-> HashAggregate (cost=4412.98..4415.98 rows=200 width=532)
Group Key: resource_records.discrim
-> CTE Scan on resource_records (cost=0.00..3209.44 rows=160472 width=532)
Planning time: 6.620 ms
(133 rows)
<p><br /> El 28/05/2014 22:12, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> escribió:<br />><br /> > On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <<a href="mailto:klaussfreire@gmail.com">klaussfreire@gmail.com</a>>wrote:<br /> > > For nestloop, correct me if I'mwrong, but index scan nodes don't have<br /> > > visibility of the next tuple to be searched for.<br /> ><br/> > Nested loop joins are considered a particularly compelling case for<br /> > prefetching, actually.<p>Ofcourse. I only doubt it can be implemented without not so small changes to all execution nodes.<p>I'll lookinto it<p>><br /> > --<br /> > Peter Geoghegan<br />