Thread: Re: [PATCH] Prefetch index pages for B-Tree index scans
> From: Claudio Freire <klaussfreire@gmail.com>
> I hope I'm not talking to myself.
Indeed not. I also looked into prefetching for pure index scans for b-trees (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl
I am not where I have a proper setup this week but will reply at greater length next week.
John
Indeed not. I also looked into prefetching for pure index scans for b-trees (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl
I am not where I have a proper setup this week but will reply at greater length next week.
John
On Tue, Oct 23, 2012 at 9:44 AM, John Lumby <johnlumby@hotmail.com> wrote: >> From: Claudio Freire <klaussfreire@gmail.com> >> I hope I'm not talking to myself. > > Indeed not. I also looked into prefetching for pure index scans for > b-trees (and extension to use async io). > http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl Yes, I've seen that, though I thought it was only an improvement on PrefetchBuffer. That patch would interact quite nicely with mine. I'm now trying to prefetch heap tuples, and I got to a really nice place where I get an extra 30% speedup even on forward scans, but the patch is rather green now for a review. > I am not where I have a proper setup this week but will reply at greater > length next week. Great - will go on improving the patch in the meanwhile ;-)
On Tue, Oct 23, 2012 at 10:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Indeed not. I also looked into prefetching for pure index scans for >> b-trees (and extension to use async io). >> http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl > > Yes, I've seen that, though I thought it was only an improvement on > PrefetchBuffer. That patch would interact quite nicely with mine. > > I'm now trying to prefetch heap tuples, and I got to a really nice > place where I get an extra 30% speedup even on forward scans, but the > patch is rather green now for a review. > >> I am not where I have a proper setup this week but will reply at greater >> length next week. > > Great - will go on improving the patch in the meanwhile ;-) Ok, this is the best I could come up with, without some real test hardware. The only improvement I see in single-disk scenarios: * Huge speedup of back-sequential index-only scans * Marginal speedupon forward index-only scans (5% or less) * No discernible difference in heap-including scans (even with heap prefetch), but I'm pretty sure a real RAID setup would change this * No change in pgbench (so I guess no regression for smalltransactions) If I manage to get my hands on test hardware, I'll post results. But we just had to bring some machines offline in our testing datacenter, which effectively shrank my options, rather than expanding them. I don't see that improving soon, so I'll post the patch and hope someone else tests. PS: should I add it to the commit fest? should we compare notes with John Limby's patch first?
On Mon, Oct 29, 2012 at 12:53 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Yes, I've seen that, though I thought it was only an improvement on >> PrefetchBuffer. That patch would interact quite nicely with mine. >> >> I'm now trying to prefetch heap tuples, and I got to a really nice >> place where I get an extra 30% speedup even on forward scans, but the >> patch is rather green now for a review. >> >>> I am not where I have a proper setup this week but will reply at greater >>> length next week. >> >> Great - will go on improving the patch in the meanwhile ;-) > > Ok, this is the best I could come up with, without some real test hardware. Oops - forgot to effectively attach the patch.
Attachment
> Ok, this is the best I could come up with, without some real test hardware. > > The only improvement I see in single-disk scenarios: > * Huge speedup of back-sequential index-only scans > * Marginal speedup on forward index-only scans (5% or less) > * No discernible difference in heap-including scans (even with heap > prefetch), but I'm pretty sure a real RAID setup would change this > * No change in pgbench (so I guess no regression for small transactions) If the gain is visible mostly for the backward and not for other access patterns I suggest to check the work done in backward-prefecthing in linux. http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example I don't know how others (BSD, windows, ...) handle this case. Maybe the strategy to use our own prefetch is better, then I would like to use it also in places where we used to hack to make linux understand that we will benefits from prefetching. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On Mon, Oct 29, 2012 at 4:17 PM, Cédric Villemain <cedric@2ndquadrant.com> wrote: >> Ok, this is the best I could come up with, without some real test hardware. >> >> The only improvement I see in single-disk scenarios: >> * Huge speedup of back-sequential index-only scans >> * Marginal speedup on forward index-only scans (5% or less) >> * No discernible difference in heap-including scans (even with heap >> prefetch), but I'm pretty sure a real RAID setup would change this >> * No change in pgbench (so I guess no regression for small transactions) > > If the gain is visible mostly for the backward and not for other access > patterns I suggest to check the work done in backward-prefecthing in linux. > > http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example That patch seems very similar (but in kernel terms) to this patch, so I imagine it would also do the job. But it also looks forgotten. Bringing it back to life would mean building the latest kernel with that patch included, replicating the benchmarks I ran here, sans pg patch, but with patched kernel, and reporting the (hopefully equally dramatic) performance improvements in the kernel ML. That would take me quite some time (not used to playing with kernels, though it wouldn't be my first time either), though it might be worth the effort. > I don't know how others (BSD, windows, ...) handle this case. I don't even think windows supports posix_fadvise, but if async_io is used (as hinted by the link Lumby posted), it would probably also work in windows. BSD probably supports it the same way linux does. > Maybe the strategy to use our own prefetch is better, then I would like to use > it also in places where we used to hack to make linux understand that we will > benefits from prefetching. It would at least benefit those installations without the latest-in-the-future kernel-with-backwards-readahead. To which places are you referring to, btw?
> But it also looks forgotten. Bringing it back to life would mean > building the latest kernel with that patch included, replicating the > benchmarks I ran here, sans pg patch, but with patched kernel, and > reporting the (hopefully equally dramatic) performance improvements in > the kernel ML. That would take me quite some time (not used to playing > with kernels, though it wouldn't be my first time either), though it > might be worth the effort. Well, informing linux hackers may help. > > I don't know how others (BSD, windows, ...) handle this case. > > I don't even think windows supports posix_fadvise, but if async_io is > used (as hinted by the link Lumby posted), it would probably also work > in windows. > > BSD probably supports it the same way linux does. I though of the opposite way: how do other kernels handle the backwards prefetch. > > > Maybe the strategy to use our own prefetch is better, then I would like > > to use it also in places where we used to hack to make linux understand > > that we will benefits from prefetching. > > It would at least benefit those installations without the > latest-in-the-future kernel-with-backwards-readahead. We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old kernel in end 2013... Maybe it won't be so latest-in-the-future at this time. Btw the improvements you are doing looks good, I just add some information regarding what is achieved around us. > > To which places are you referring to, btw? Maintenance tasks. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain <cedric@2ndquadrant.com> wrote: >> But it also looks forgotten. Bringing it back to life would mean >> building the latest kernel with that patch included, replicating the >> benchmarks I ran here, sans pg patch, but with patched kernel, and >> reporting the (hopefully equally dramatic) performance improvements in >> the kernel ML. That would take me quite some time (not used to playing >> with kernels, though it wouldn't be my first time either), though it >> might be worth the effort. > > Well, informing linux hackers may help. I agree. I'm a bit hesitant to subscribe to yet another mailing list, but I happen to agree. >> > I don't know how others (BSD, windows, ...) handle this case. >> >> I don't even think windows supports posix_fadvise, but if async_io is >> used (as hinted by the link Lumby posted), it would probably also work >> in windows. >> >> BSD probably supports it the same way linux does. > > I though of the opposite way: how do other kernels handle the backwards > prefetch. From what I saw (while reasearching that statement above), BSD's read-ahead and fadvise implementations are way behind linux's. Functional, though. I haven't been able to find the code responsible for readahead in FreeBSD yet to confirm whether they have anything supporting back-sequential patterns. >> > Maybe the strategy to use our own prefetch is better, then I would like >> > to use it also in places where we used to hack to make linux understand >> > that we will benefits from prefetching. >> >> It would at least benefit those installations without the >> latest-in-the-future kernel-with-backwards-readahead. > > We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old > kernel in end 2013... Maybe it won't be so latest-in-the-future at this time. Good point.
Claudio wrote : > > Oops - forgot to effectively attach the patch. > I've read through your patch and the earlier posts by you and Cédric. This is very interesting. You chose to prefetch index btree (key-ptr) pages whereas I chose to prefetch the data pages pointed to by the key-ptr pages. Never mind why -- I think they should work very well together - as both have been demonstrated to produce improvements. I will see if I can combine them, git permitting (as of course their changed file lists overlap). I was surprised by this design decision : /* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those cases*/ Is it really? Are you assuming the it's redundant with posix_fadvise for this case? I think possibly when async_io is also in use by the postgresql prefetcher, this decision could change. Cédric wrote: > If the gain is visible mostly for the backward and not for other access > > building the latest kernel with that patch included, replicating the > I found improvement from forward scans. Actually I did not even try backward but only because I did not have time. It should help both. >> I don't even think windows supports posix_fadvise, but if async_io is >> used (as hinted by the link Lumby posted), it would probably also work >> in windows. windows has async io and I think it would not be hard to extend my implementation to windows (although I don't plan it myself). Actually about 95% of the code I wrote to implement async-io in postgresql concerns not the async io, which is trivial, but the buffer management. With async io, PrefetchBuffer must allocate and pin a buffer, (not too hard), but now also every other part of buf mgr must know about the possibility that a buffer may be in async_io_in_progress state and be prepared to determine the possible completion (quite hard) - and also if and when the prefetch requester comes again with ReadBuffer, buf mgr has to remember that this buffer was pinned by this backend during previous prefetch and must not be re-pinned a second time (very hard without increasing size of the shared descriptor, which was important since there could be a very large number of these). It ended up with a major change to bufmgr.c plus one new file for handling buffer management aspects of starting, checking and terminating async io. However I think in some environments the async-io has significant benefits over posix-fadvise, especially (of course!) where access is very non-sequential, but even also for sequential if there are many concurrent conflicting sets of sequential command streams from different backends (always assuming the RAID can manage them concurrently). I've attached a snapshot patch of just the non-bitmap-index-scan changes I've made. You can't compile it as is because I had to change the interface to PrefetchBuffer and add a new DiscardBuffer which I did not include in this snapshot to avoid confusing. John
Attachment
On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain > <cedric@2ndquadrant.com> wrote: >> Well, informing linux hackers may help. > > I agree. I'm a bit hesitant to subscribe to yet another mailing list FYI you can send messages to linux-kernel without subscribing (there's no moderation either). Subscribing to linux-kernel is like drinking from a firehose :) Regards, Marti
On Thursday, November 01, 2012 05:53:39 PM Marti Raudsepp wrote: > On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > > On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain > > > > <cedric@2ndquadrant.com> wrote: > >> Well, informing linux hackers may help. > > > > I agree. I'm a bit hesitant to subscribe to yet another mailing list > > FYI you can send messages to linux-kernel without subscribing (there's > no moderation either). > > Subscribing to linux-kernel is like drinking from a firehose :) linux-fsdevel is more reasonable though... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 1, 2012 at 1:37 PM, John Lumby <johnlumby@hotmail.com> wrote: > > Claudio wrote : >> >> Oops - forgot to effectively attach the patch. >> > > I've read through your patch and the earlier posts by you and Cédric. > > This is very interesting. You chose to prefetch index btree (key-ptr) pages > whereas I chose to prefetch the data pages pointed to by the key-ptr pages. > Never mind why -- I think they should work very well together - as both have > been demonstrated to produce improvements. I will see if I can combine them, > git permitting (as of course their changed file lists overlap). Check the latest patch, it contains heap page prefetching too. > I was surprised by this design decision : > /* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those cases*/ > Is it really? Are you assuming the it's redundant with posix_fadvise for this case? > I think possibly when async_io is also in use by the postgresql prefetcher, > this decision could change. async_io indeed may make that logic obsolete, but it's not redundant posix_fadvise what's the trouble there, but the fact that the kernel stops doing read-ahead when a call to posix_fadvise comes. I noticed the performance hit, and checked the kernel's code. It effectively changes the prediction mode from sequential to fadvise, negating the (assumed) kernel's prefetch logic. > However I think in some environments the async-io has significant benefits over > posix-fadvise, especially (of course!) where access is very non-sequential, > but even also for sequential if there are many concurrent conflicting sets of sequential > command streams from different backends > (always assuming the RAID can manage them concurrently). I've mused about the possibility to batch async_io requests, and use the scatter/gather API insead of sending tons of requests to the kernel. I think doing so would enable a zero-copy path that could very possibly imply big speed improvements when memory bandwidth is the bottleneck.
On Thu, Nov 1, 2012 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I agree. I'm a bit hesitant to subscribe to yet another mailing list >> >> FYI you can send messages to linux-kernel without subscribing (there's >> no moderation either). >> >> Subscribing to linux-kernel is like drinking from a firehose :) > > linux-fsdevel is more reasonable though... readahead logic is not at the filesystem level, but the memory mapper's. I'll consider posting without subscribing.
Claudio wrote : > > Check the latest patch, it contains heap page prefetching too. > Oh yes I see. I missed that - I was looking in the wrong place. I do have one question about the way you did it : by placing the prefetch heap-page calls in _bt_next, which effectively means inside a call from the index am index_getnext_tid to btgettuple, are you sure you are synchronizing your prefetches of heap pages with the index am's ReadBuffer's of heap pages? I.e. are you complying with this comment from nodeBitmapHeapscan.c for prefetching its bitmap heap pages in the bitmap-index-scan case: * We issue prefetch requests *after* fetching the current page to try * to avoid having prefetching interfere with the main I/O. I can't really tell whether your design conforms to this and nor do I know whether it is important, but I decided to do it in the same manner, and so implemented the heap-page fetching in index_fetch_heap > > async_io indeed may make that logic obsolete, but it's not redundant > posix_fadvise what's the trouble there, but the fact that the kernel > stops doing read-ahead when a call to posix_fadvise comes. I noticed > the performance hit, and checked the kernel's code. It effectively > changes the prediction mode from sequential to fadvise, negating the > (assumed) kernel's prefetch logic. > I did not know that. Very interesting. > > I've mused about the possibility to batch async_io requests, and use > the scatter/gather API insead of sending tons of requests to the > kernel. I think doing so would enable a zero-copy path that could very > possibly imply big speed improvements when memory bandwidth is the > bottleneck. I think you are totally correct on this point. If I recall, the glic (librt) aio does have an lio_listio but it is either a noop or just loops over the list, I forget which (don't have its source right now), but in any case I am sure there is a potential for implementing such a facility. But to be really effective, it should be implemented in the kernel itself, which we don't have today. John
On 11/1/12 6:13 PM, Claudio Freire wrote: > posix_fadvise what's the trouble there, but the fact that the kernel > stops doing read-ahead when a call to posix_fadvise comes. I noticed > the performance hit, and checked the kernel's code. It effectively > changes the prediction mode from sequential to fadvise, negating the > (assumed) kernel's prefetch logic. That's really interesting. There was a patch submitted at one point to use POSIX_FADV_SEQUENTIAL on sequential scans, and that wasn't a repeatable improvement either, so it was canned at http://archives.postgresql.org/pgsql-hackers/2008-10/msg01611.php The Linux posix_fadvise implementation never seemed like it was well liked by the kernel developers. Quirky stuff like this popped up all the time during that period, when effective_io_concurrency was being added. I wonder how far back the fadvise/read-ahead conflict goes back. > I've mused about the possibility to batch async_io requests, and use> the scatter/gather API instead of sending tons ofrequests to the> kernel. I think doing so would enable a zero-copy path that could very> possibly imply big speed improvementswhen memory bandwidth is the> bottleneck. Another possibly useful bit of history here for you. Greg Stark wrote a test program that used async I/O effectively on both Linux and Solaris. Unfortunately, it was hard to get that to work givenhow Postgres does its buffer I/O, and using processes instead of threads. This looks like the place he commented on why: http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html The part I think was relevant there from him: "In the libaio view of the world you initiate io and either get a callback or call another syscall to test if it's complete. Either approach has problems for Postgres. If the process that initiated io is in the middle of a long query it might take a long time, or not even never get back to complete the io. The callbacks use threads... And polling for completion has the problem that another process could be waiting on the io and can't issue a read as long as the first process has the buffer locked and io in progress. I think aio makes a lot more sense if you're using threads so you can start a thread to wait for the io to complete." -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 11/1/12 6:13 PM, Claudio Freire wrote: > >> posix_fadvise what's the trouble there, but the fact that the kernel >> stops doing read-ahead when a call to posix_fadvise comes. I noticed >> the performance hit, and checked the kernel's code. It effectively >> changes the prediction mode from sequential to fadvise, negating the >> (assumed) kernel's prefetch logic. > > ... > > The Linux posix_fadvise implementation never seemed like it was well liked > by the kernel developers. Quirky stuff like this popped up all the time > during that period, when effective_io_concurrency was being added. I wonder > how far back the fadvise/read-ahead conflict goes back. Well, to be precise it's not so much as a problem in posix_fadvise itself, it's a problem in how it interacts with readahead. Since readahead works at the memory mapper level, and only when actually performing I/O (which would seem at first glance quite sensible), it doesn't get to see fadvise activity. FADV_WILLNEED is implemented as a forced readahead, which doesn't update any of the readahead context structures. Again, at first glance, this would seem sensible (explicit hints shouldn't interfere with pattern detection logic). However, since those pages are (after the fadvise call) under async I/O, next time the memory mapper needs that page, instead of requesting I/O through readahead logic, it will wait for async I/O to complete. IOW, what was sequential in fact, became invisible to readahead, indistinguishable from random I/O. Whatever page fadvise failed to predict will be treated as random I/O, and here the trouble lies. >> I've mused about the possibility to batch async_io requests, and use >> the scatter/gather API instead of sending tons of requests to the > >> kernel. I think doing so would enable a zero-copy path that could very >> possibly imply big speed improvements when memory bandwidth is the >> bottleneck. > > Another possibly useful bit of history here for you. Greg Stark wrote a > test program that used async I/O effectively on both Linux and Solaris. > Unfortunately, it was hard to get that to work given how Postgres does its > buffer I/O, and using processes instead of threads. This looks like the > place he commented on why: > > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html > > The part I think was relevant there from him: > > "In the libaio view of the world you initiate io and either get a > callback or call another syscall to test if it's complete. Either > approach has problems for Postgres. If the process that initiated io > is in the middle of a long query it might take a long time, or not even > never get back to complete the io. The callbacks use threads... > > And polling for completion has the problem that another process could > be waiting on the io and can't issue a read as long as the first > process has the buffer locked and io in progress. I think aio makes a > lot more sense if you're using threads so you can start a thread to > wait for the io to complete." I noticed that. I always envisioned async I/O as managed by some dedicated process. One that could check for completion or receive callbacks. Postmaster, for instance.
Claudio Freire wrote: > > On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > On 11/1/12 6:13 PM, Claudio Freire wrote: > > > >> posix_fadvise what's the trouble there, but the fact that the kernel > >> stops doing read-ahead when a call to posix_fadvise comes. I noticed > >> the performance hit, and checked the kernel's code. It effectively > >> changes the prediction mode from sequential to fadvise, negating the > >> (assumed) kernel's prefetch logic. > > > > The Linux posix_fadvise implementation never seemed like it was well liked > > by the kernel developers. Quirky stuff like this popped up all the time > > during that period, when effective_io_concurrency was being added. I wonder > > how far back the fadvise/read-ahead conflict goes back. > > Well, to be precise it's not so much as a problem in posix_fadvise > itself, it's a problem in how it interacts with readahead. Since > readahead works at the memory mapper level, and only when actually > performing I/O (which would seem at first glance quite sensible), it > doesn't get to see fadvise activity. > > FADV_WILLNEED is implemented as a forced readahead, which doesn't > update any of the readahead context structures. Again, at first > glance, this would seem sensible (explicit hints shouldn't interfere > with pattern detection logic). However, since those pages are (after > the fadvise call) under async I/O, next time the memory mapper needs > that page, instead of requesting I/O through readahead logic, it will > wait for async I/O to complete. > > IOW, what was sequential in fact, became invisible to readahead, > indistinguishable from random I/O. Whatever page fadvise failed to > predict will be treated as random I/O, and here the trouble lies. And this may be one other advantage of async io over posix_fadvise in the linux environment (with the present mmap behaviour) : that async io achives the same objective of improving disk/processing overlap without the mentioned interference with read-ahead. Although to confirm this would ideally require 3-way comparing posix-fadvise + existing readahead behaviour posix-fadvise + modify existing readahead behaviour to not force waiting for current async io (i.e. just check the aio and continue normal readahead if in progress) async io wth no posix-fadvise It seems in general to be preferable to have an implementation that is less dependent on specific behaviour of the OS read-head mechanism. > > >> I've mused about the possibility to batch async_io requests, and use > >> the scatter/gather API instead of sending tons of requests to the > > > >> kernel. I think doing so would enable a zero-copy path that could very > >> possibly imply big speed improvements when memory bandwidth is the > >> bottleneck. > > > > Another possibly useful bit of history here for you. Greg Stark wrote a > > test program that used async I/O effectively on both Linux and Solaris. > > Unfortunately, it was hard to get that to work given how Postgres does its > > buffer I/O, and using processes instead of threads. This looks like the > > place he commented on why: > > > > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html > > > > The part I think was relevant there from him: > > > > "In the libaio view of the world you initiate io and either get a > > callback or call another syscall to test if it's complete. Either > > approach has problems for Postgres. If the process that initiated io > > is in the middle of a long query it might take a long time, or not even > > never get back to complete the io. The callbacks use threads... > > > > And polling for completion has the problem that another process could > > be waiting on the io and can't issue a read as long as the first > > process has the buffer locked and io in progress. I think aio makes a > > lot more sense if you're using threads so you can start a thread to > > wait for the io to complete." > > I noticed that. I always envisioned async I/O as managed by some > dedicated process. One that could check for completion or receive > callbacks. Postmaster, for instance. Thanks for the mentioning this posting. Interesting. However, the OP describes an implementation based on libaio. Today what we have (for linux) is librt, which is quite different. It is arguable worse than libaio (well actually I am sure it is worse) since it is essentially just an encapsulation of using threads to do synchronous ios - you can look at it as making it easier to do what the application could do itself if it set up its own pthreads. The linux kernel does not know about it and so the CPU overhead of checking for completion is higher. But if async io is used *only* for prefetching, and not for the actual ReadBuffer itself (which is what I did), then the problem mentioned by the OP "If the process that initiated io is in the middle of a long query it might take a long time, or never get back to complete the io" is not a problem. The model is simple: 1 . backend process P1 requestrs prefetch on a relation/block R/B which results in initating aio_read using a (new) shared control block (call it pg_buf_aiocb) which tracks the request and also contains the librt's aiocb. 2 . any backend P2 (which may or may not be P1) issues ReadBuffer or similar, requesting access for read/pin to buffer containing R/B. this backend discovers that the buffer is aio_in_progress and calls check_completion(pg_buf_aiocb), and waits (effectively on the librt thread) if not complete. 3 . any number of other backends may concurrently do same as 2. I.e. If pg_buf_aiocb is still aio-in-progress, they all also wait on the librt thread. 4. Each waiting backend receives the completion and the last one does the housekeeping and returns the pg_buf_aiocb. What complicates it is managing the associated pinned buffer in such a way that every backend takes the correct action with the correct degree of serialization of the buffer descriptor during critical sections, but yet allowing all backends in 3. above to concurrently wait/check. After quite a lot of testing I think I now this correct. ("I just found the *last* bug!" :-) John
On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote: > Thanks for the mentioning this posting. Interesting. > However, the OP describes an implementation based on libaio. > Today what we have (for linux) is librt, which is quite different. > It is arguable worse than libaio (well actually I am sure it is worse) > since it is essentially just an encapsulation of using threads to do > synchronous ios - you can look at it as making it easier to do what the > application could do itself if it set up its own pthreads. The linux > kernel does not know about it and so the CPU overhead of checking for > completion is higher. Well, good thing we didn't switch to using libaio, now that it is gone. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian wrote: > > On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote: > > However, the OP describes an implementation based on libaio. > > Today what we have (for linux) is librt, which is quite different. > > Well, good thing we didn't switch to using libaio, now that it is gone. > Yes, I think you are correct. Although I should correct myself about status of libaio - it seems many distros continue to provide it and at least one other popular database (MySQL) uses it, but as far as I can tell the content has not been updated by the original authors for around 10 years. That is perhaps not surprising since it does very little other than wrap the linux kernel syscalls. Set against the CPU-overhead disadvantage of librt, I think the three main advantages of librt vs libaio/kernel-aio for postgresql are : . posix standard, and probably easier to provide very similar implementation on windows (I see at least one posix aio lib for windows) . no restrictions on the way files are accessed (kernel-aio imposes restrictions on open() flags and buffer alignment etc) . it seems (from the recent postings about the earlier attempt to implement async io using libaio) that the posix threads style lends itself better to fitting in with the postgresql backend model. John