Thread: Re: [PATCH] Prefetch index pages for B-Tree index scans

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
John Lumby
Date:
> From: Claudio Freire <klaussfreire@gmail.com>
> I hope I'm not talking to myself.

Indeed not.   I also looked into prefetching for pure index scans for b-trees  (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

I am not where I have a proper setup this week but will reply at greater length next week.

John

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Tue, Oct 23, 2012 at 9:44 AM, John Lumby <johnlumby@hotmail.com> wrote:
>> From: Claudio Freire <klaussfreire@gmail.com>
>> I hope I'm not talking to myself.
>
> Indeed not.   I also looked into prefetching for pure index scans for
> b-trees  (and extension to use async io).
> http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

Yes, I've seen that, though I thought it was only an improvement on
PrefetchBuffer. That patch would interact quite nicely with mine.

I'm now trying to prefetch heap tuples, and I got to a really nice
place where I get an extra 30% speedup even on forward scans, but the
patch is rather green now for a review.

> I am not where I have a proper setup this week but will reply at greater
> length next week.

Great - will go on improving the patch in the meanwhile ;-)



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Tue, Oct 23, 2012 at 10:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Indeed not.   I also looked into prefetching for pure index scans for
>> b-trees  (and extension to use async io).
>> http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl
>
> Yes, I've seen that, though I thought it was only an improvement on
> PrefetchBuffer. That patch would interact quite nicely with mine.
>
> I'm now trying to prefetch heap tuples, and I got to a really nice
> place where I get an extra 30% speedup even on forward scans, but the
> patch is rather green now for a review.
>
>> I am not where I have a proper setup this week but will reply at greater
>> length next week.
>
> Great - will go on improving the patch in the meanwhile ;-)

Ok, this is the best I could come up with, without some real test hardware.

The only improvement I see in single-disk scenarios: * Huge speedup of back-sequential index-only scans * Marginal
speedupon forward index-only scans (5% or less) * No discernible difference in heap-including scans (even with heap
 
prefetch), but I'm pretty sure a real RAID setup would change this * No change in pgbench (so I guess no regression for
smalltransactions)
 

If I manage to get my hands on test hardware, I'll post results. But
we just had to bring some machines offline in our testing datacenter,
which effectively shrank my options, rather than expanding them. I
don't see that improving soon, so I'll post the patch and hope someone
else tests.

PS: should I add it to the commit fest? should we compare notes with
John Limby's patch first?



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Mon, Oct 29, 2012 at 12:53 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Yes, I've seen that, though I thought it was only an improvement on
>> PrefetchBuffer. That patch would interact quite nicely with mine.
>>
>> I'm now trying to prefetch heap tuples, and I got to a really nice
>> place where I get an extra 30% speedup even on forward scans, but the
>> patch is rather green now for a review.
>>
>>> I am not where I have a proper setup this week but will reply at greater
>>> length next week.
>>
>> Great - will go on improving the patch in the meanwhile ;-)
>
> Ok, this is the best I could come up with, without some real test hardware.

Oops - forgot to effectively attach the patch.

Attachment

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Cédric Villemain
Date:
> Ok, this is the best I could come up with, without some real test hardware.
>
> The only improvement I see in single-disk scenarios:
>   * Huge speedup of back-sequential index-only scans
>   * Marginal speedup on forward index-only scans (5% or less)
>   * No discernible difference in heap-including scans (even with heap
> prefetch), but I'm pretty sure a real RAID setup would change this
>   * No change in pgbench (so I guess no regression for small transactions)

If the gain is visible mostly for the backward and not for other access
patterns I suggest to check the work done in backward-prefecthing in linux.

http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

I don't know how others (BSD, windows, ...) handle this case.


Maybe the strategy to use our own prefetch is better, then I would like to use
it also in places where we used to hack to make linux understand that we will
benefits from prefetching.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Mon, Oct 29, 2012 at 4:17 PM, Cédric Villemain
<cedric@2ndquadrant.com> wrote:
>> Ok, this is the best I could come up with, without some real test hardware.
>>
>> The only improvement I see in single-disk scenarios:
>>   * Huge speedup of back-sequential index-only scans
>>   * Marginal speedup on forward index-only scans (5% or less)
>>   * No discernible difference in heap-including scans (even with heap
>> prefetch), but I'm pretty sure a real RAID setup would change this
>>   * No change in pgbench (so I guess no regression for small transactions)
>
> If the gain is visible mostly for the backward and not for other access
> patterns I suggest to check the work done in backward-prefecthing in linux.
>
> http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

That patch seems very similar (but in kernel terms) to this patch, so
I imagine it would also do the job.

But it also looks forgotten. Bringing it back to life would mean
building the latest kernel with that patch included, replicating the
benchmarks I ran here, sans pg patch, but with patched kernel, and
reporting the (hopefully equally dramatic) performance improvements in
the kernel ML. That would take me quite some time (not used to playing
with kernels, though it wouldn't be my first time either), though it
might be worth the effort.

> I don't know how others (BSD, windows, ...) handle this case.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

BSD probably supports it the same way linux does.

> Maybe the strategy to use our own prefetch is better, then I would like to use
> it also in places where we used to hack to make linux understand that we will
> benefits from prefetching.

It would at least benefit those installations without the
latest-in-the-future kernel-with-backwards-readahead.

To which places are you referring to, btw?



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Cédric Villemain
Date:
> But it also looks forgotten. Bringing it back to life would mean
> building the latest kernel with that patch included, replicating the
> benchmarks I ran here, sans pg patch, but with patched kernel, and
> reporting the (hopefully equally dramatic) performance improvements in
> the kernel ML. That would take me quite some time (not used to playing
> with kernels, though it wouldn't be my first time either), though it
> might be worth the effort.

Well, informing linux hackers may help.

> > I don't know how others (BSD, windows, ...) handle this case.
>
> I don't even think windows supports posix_fadvise, but if async_io is
> used (as hinted by the link Lumby posted), it would probably also work
> in windows.
>
> BSD probably supports it the same way linux does.

I though of the opposite way: how do other kernels handle the backwards
prefetch.

>
> > Maybe the strategy to use our own prefetch is better, then I would like
> > to use it also in places where we used to hack to make linux understand
> > that we will benefits from prefetching.
>
> It would at least benefit those installations without the
> latest-in-the-future kernel-with-backwards-readahead.

We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old
kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Btw the improvements you are doing looks good, I just add some information
regarding what is achieved around us.

>
> To which places are you referring to, btw?

Maintenance tasks.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
<cedric@2ndquadrant.com> wrote:
>> But it also looks forgotten. Bringing it back to life would mean
>> building the latest kernel with that patch included, replicating the
>> benchmarks I ran here, sans pg patch, but with patched kernel, and
>> reporting the (hopefully equally dramatic) performance improvements in
>> the kernel ML. That would take me quite some time (not used to playing
>> with kernels, though it wouldn't be my first time either), though it
>> might be worth the effort.
>
> Well, informing linux hackers may help.

I agree. I'm a bit hesitant to subscribe to yet another mailing list,
but I happen to agree.

>> > I don't know how others (BSD, windows, ...) handle this case.
>>
>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.
>>
>> BSD probably supports it the same way linux does.
>
> I though of the opposite way: how do other kernels handle the backwards
> prefetch.

From what I saw (while reasearching that statement above), BSD's
read-ahead and fadvise implementations are way behind linux's.
Functional, though. I haven't been able to find the code responsible
for readahead in FreeBSD yet to confirm whether they have anything
supporting back-sequential patterns.

>> > Maybe the strategy to use our own prefetch is better, then I would like
>> > to use it also in places where we used to hack to make linux understand
>> > that we will benefits from prefetching.
>>
>> It would at least benefit those installations without the
>> latest-in-the-future kernel-with-backwards-readahead.
>
> We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old
> kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Good point.



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
John Lumby
Date:
Claudio wrote :
>
> Oops - forgot to effectively attach the patch.
>

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting.      You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why  --  I think they should work very well together  -  as both have
been demonstrated to produce improvements.   I will see if I can combine them,
git permitting  (as of course their changed file lists overlap).

I was surprised by this design decision :
    /* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those
cases*/ 
Is it really?    Are you assuming the it's redundant with posix_fadvise for this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

Cédric wrote:
> If the gain is visible mostly for the backward and not for other access
>
> building the latest kernel with that patch included, replicating the
>

I  found improvement from forward scans.
Actually I did not even try backward but only because I did not have time.
It should help both.

>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.

windows has async io and I think it would not be hard to extend my implementation
to windows  (although I don't plan it myself).     Actually about 95% of the code I wrote
to implement async-io in postgresql concerns not the async io,  which is trivial,
but the buffer management.   With async io,   PrefetchBuffer must allocate and pin a
buffer,  (not too hard),   but now also every other part of buf mgr must know about the
possibility that a buffer may be in async_io_in_progress state and be prepared to
determine the possible completion (quite hard)   -  and also if and when the prefetch requester
comes again with ReadBuffer,  buf mgr has to remember that this buffer was pinned by
this backend during previous prefetch and must not be re-pinned a second time
(very hard without increasing size of the shared descriptor,  which was important
since there could be a very large number of these).
It ended up with a major change to bufmgr.c plus one new file for handling
buffer management aspects of starting, checking and terminating async io.

However I think in some environments the async-io has significant benefits over
posix-fadvise,  especially (of course!)   where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've attached a snapshot patch of just the non-bitmap-index-scan changes I've made.
You can't compile it as is because I had to change the interface to PrefetchBuffer
and add a new DiscardBuffer which I did not include in this snapshot to avoid confusing.

John


Attachment

Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Marti Raudsepp
Date:
On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
> <cedric@2ndquadrant.com> wrote:
>> Well, informing linux hackers may help.
>
> I agree. I'm a bit hesitant to subscribe to yet another mailing list

FYI you can send messages to linux-kernel without subscribing (there's
no moderation either).

Subscribing to linux-kernel is like drinking from a firehose :)

Regards,
Marti



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Andres Freund
Date:
On Thursday, November 01, 2012 05:53:39 PM Marti Raudsepp wrote:
> On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:
> > On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
> >
> > <cedric@2ndquadrant.com> wrote:
> >> Well, informing linux hackers may help.
> >
> > I agree. I'm a bit hesitant to subscribe to yet another mailing list
>
> FYI you can send messages to linux-kernel without subscribing (there's
> no moderation either).
>
> Subscribing to linux-kernel is like drinking from a firehose :)

linux-fsdevel is more reasonable though...

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Thu, Nov 1, 2012 at 1:37 PM, John Lumby <johnlumby@hotmail.com> wrote:
>
> Claudio wrote :
>>
>> Oops - forgot to effectively attach the patch.
>>
>
> I've read through your patch and the earlier posts by you and Cédric.
>
> This is very interesting.      You chose to prefetch index btree (key-ptr) pages
> whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
> Never mind why  --  I think they should work very well together  -  as both have
> been demonstrated to produce improvements.   I will see if I can combine them,
> git permitting  (as of course their changed file lists overlap).

Check the latest patch, it contains heap page prefetching too.

> I was surprised by this design decision :
>     /* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those
cases*/ 
> Is it really?    Are you assuming the it's redundant with posix_fadvise for this case?
> I think possibly when async_io is also in use by the postgresql prefetcher,
> this decision could change.

async_io indeed may make that logic obsolete, but it's not redundant
posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

> However I think in some environments the async-io has significant benefits over
> posix-fadvise,  especially (of course!)   where access is very non-sequential,
> but even also for sequential if there are many concurrent conflicting sets of sequential
> command streams from different backends
> (always assuming the RAID can manage them concurrently).

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API insead of sending tons of requests to the
kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Thu, Nov 1, 2012 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I agree. I'm a bit hesitant to subscribe to yet another mailing list
>>
>> FYI you can send messages to linux-kernel without subscribing (there's
>> no moderation either).
>>
>> Subscribing to linux-kernel is like drinking from a firehose :)
>
> linux-fsdevel is more reasonable though...

readahead logic is not at the filesystem level, but the memory mapper's.

I'll consider posting without subscribing.



FW: [PATCH] Prefetch index pages for B-Tree index scans

From
John Lumby
Date:
Claudio wrote :
>
> Check the latest patch, it contains heap page prefetching too.
>

Oh yes I see. I missed that - I was looking in the wrong place.
I do have one question about the way you did it : by placing the
prefetch heap-page calls in _bt_next, which effectively means inside
a call from the index am index_getnext_tid to btgettuple, are you sure
you are synchronizing your prefetches of heap pages with the index am's
ReadBuffer's of heap pages? I.e. are you complying with this comment
from nodeBitmapHeapscan.c for prefetching its bitmap heap pages in
the bitmap-index-scan case:

* We issue prefetch requests *after* fetching the current page to try
* to avoid having prefetching interfere with the main I/O.

I can't really tell whether your design conforms to this and nor do I
know whether it is important, but I decided to do it in the same manner,
and so implemented the heap-page fetching in index_fetch_heap

>
> async_io indeed may make that logic obsolete, but it's not redundant
> posix_fadvise what's the trouble there, but the fact that the kernel
> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> the performance hit, and checked the kernel's code. It effectively
> changes the prediction mode from sequential to fadvise, negating the
> (assumed) kernel's prefetch logic.
>
I did not know that. Very interesting.


>
> I've mused about the possibility to batch async_io requests, and use
> the scatter/gather API insead of sending tons of requests to the
> kernel. I think doing so would enable a zero-copy path that could very
> possibly imply big speed improvements when memory bandwidth is the
> bottleneck.

I think you are totally correct on this point. If I recall, the
glic (librt) aio does have an lio_listio but it is either a noop
or just loops over the list, I forget which (don't have its source right now),
but in any case I am sure there is a potential for implementing such a facility.
But to be really effective, it should be implemented in the kernel itself,
which we don't have today.

John


Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Greg Smith
Date:
On 11/1/12 6:13 PM, Claudio Freire wrote:

> posix_fadvise what's the trouble there, but the fact that the kernel
> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> the performance hit, and checked the kernel's code. It effectively
> changes the prediction mode from sequential to fadvise, negating the
> (assumed) kernel's prefetch logic.

That's really interesting.  There was a patch submitted at one point to 
use POSIX_FADV_SEQUENTIAL on sequential scans, and that wasn't a 
repeatable improvement either, so it was canned at
http://archives.postgresql.org/pgsql-hackers/2008-10/msg01611.php

The Linux posix_fadvise implementation never seemed like it was well 
liked by the kernel developers.  Quirky stuff like this popped up all 
the time during that period, when effective_io_concurrency was being 
added.  I wonder how far back the fadvise/read-ahead conflict goes back.
> I've mused about the possibility to batch async_io requests, and use> the scatter/gather API instead of sending tons
ofrequests to the> kernel. I think doing so would enable a zero-copy path that could very> possibly imply big speed
improvementswhen memory bandwidth is the> bottleneck.
 

Another possibly useful bit of history here for you.  Greg Stark wrote a 
test program that used async I/O effectively on both Linux and Solaris.  Unfortunately, it was hard to get that to work
givenhow Postgres does 
 
its buffer I/O, and using processes instead of threads.  This looks like 
the place he commented on why:

http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html

The part I think was relevant there from him:

"In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for Postgres. If the process that initiated io
is in the middle of a long query it might take a long time, or not even 
never get back to complete the io. The callbacks use threads...

And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete."

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Claudio Freire
Date:
On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 11/1/12 6:13 PM, Claudio Freire wrote:
>
>> posix_fadvise what's the trouble there, but the fact that the kernel
>> stops doing read-ahead when a call to posix_fadvise comes. I noticed
>> the performance hit, and checked the kernel's code. It effectively
>> changes the prediction mode from sequential to fadvise, negating the
>> (assumed) kernel's prefetch logic.
>
>
...
>
> The Linux posix_fadvise implementation never seemed like it was well liked
> by the kernel developers.  Quirky stuff like this popped up all the time
> during that period, when effective_io_concurrency was being added.  I wonder
> how far back the fadvise/read-ahead conflict goes back.

Well, to be precise it's not so much as a problem in posix_fadvise
itself, it's a problem in how it interacts with readahead. Since
readahead works at the memory mapper level, and only when actually
performing I/O (which would seem at first glance quite sensible), it
doesn't get to see fadvise activity.

FADV_WILLNEED is implemented as a forced readahead, which doesn't
update any of the readahead context structures. Again, at first
glance, this would seem sensible (explicit hints shouldn't interfere
with pattern detection logic). However, since those pages are (after
the fadvise call) under async I/O, next time the memory mapper needs
that page, instead of requesting I/O through readahead logic, it will
wait for async I/O to complete.

IOW, what was sequential in fact, became invisible to readahead,
indistinguishable from random I/O. Whatever page fadvise failed to
predict will be treated as random I/O, and here the trouble lies.

>> I've mused about the possibility to batch async_io requests, and use
>> the scatter/gather API instead of sending tons of requests to the
>
>> kernel. I think doing so would enable a zero-copy path that could very
>> possibly imply big speed improvements when memory bandwidth is the
>> bottleneck.
>
> Another possibly useful bit of history here for you.  Greg Stark wrote a
> test program that used async I/O effectively on both Linux and Solaris.
> Unfortunately, it was hard to get that to work given how Postgres does its
> buffer I/O, and using processes instead of threads.  This looks like the
> place he commented on why:
>
> http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
>
> The part I think was relevant there from him:
>
> "In the libaio view of the world you initiate io and either get a
> callback or call another syscall to test if it's complete. Either
> approach has problems for Postgres. If the process that initiated io
> is in the middle of a long query it might take a long time, or not even
> never get back to complete the io. The callbacks use threads...
>
> And polling for completion has the problem that another process could
> be waiting on the io and can't issue a read as long as the first
> process has the buffer locked and io in progress. I think aio makes a
> lot more sense if you're using threads so you can start a thread to
> wait for the io to complete."

I noticed that. I always envisioned async I/O as managed by some
dedicated process. One that could check for completion or receive
callbacks. Postmaster, for instance.



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
John Lumby
Date:
Claudio Freire wrote:
>
> On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> > On 11/1/12 6:13 PM, Claudio Freire wrote:
> >
> >> posix_fadvise what's the trouble there, but the fact that the kernel
> >> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> >> the performance hit, and checked the kernel's code. It effectively
> >> changes the prediction mode from sequential to fadvise, negating the
> >> (assumed) kernel's prefetch logic.
> >
> > The Linux posix_fadvise implementation never seemed like it was well liked
> > by the kernel developers. Quirky stuff like this popped up all the time
> > during that period, when effective_io_concurrency was being added. I wonder
> > how far back the fadvise/read-ahead conflict goes back.
>
> Well, to be precise it's not so much as a problem in posix_fadvise
> itself, it's a problem in how it interacts with readahead. Since
> readahead works at the memory mapper level, and only when actually
> performing I/O (which would seem at first glance quite sensible), it
> doesn't get to see fadvise activity.
>
> FADV_WILLNEED is implemented as a forced readahead, which doesn't
> update any of the readahead context structures. Again, at first
> glance, this would seem sensible (explicit hints shouldn't interfere
> with pattern detection logic). However, since those pages are (after
> the fadvise call) under async I/O, next time the memory mapper needs
> that page, instead of requesting I/O through readahead logic, it will
> wait for async I/O to complete.
>
> IOW, what was sequential in fact, became invisible to readahead,
> indistinguishable from random I/O. Whatever page fadvise failed to
> predict will be treated as random I/O, and here the trouble lies.

And this may be one other advantage of async io over posix_fadvise in
the linux environment (with the present mmap behaviour) :
that async io achives the same objective of improving disk/processing overlap
without the mentioned interference with read-ahead.
Although to confirm this would ideally require 3-way comparing
   posix-fadvise + existing readahead behaviour
   posix-fadvise + modify existing readahead behaviour
                              to not force waiting for current async io
                              (i.e. just check the aio and continue normal readahead if in progress)
   async io wth no posix-fadvise

It seems in general to be preferable to have an implementation that is
less dependent on specific behaviour of the OS read-head mechanism.

>
> >> I've mused about the possibility to batch async_io requests, and use
> >> the scatter/gather API instead of sending tons of requests to the
> >
> >> kernel. I think doing so would enable a zero-copy path that could very
> >> possibly imply big speed improvements when memory bandwidth is the
> >> bottleneck.
> >
> > Another possibly useful bit of history here for you. Greg Stark wrote a
> > test program that used async I/O effectively on both Linux and Solaris.
> > Unfortunately, it was hard to get that to work given how Postgres does its
> > buffer I/O, and using processes instead of threads. This looks like the
> > place he commented on why:
> >
> > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
> >
> > The part I think was relevant there from him:
> >
> > "In the libaio view of the world you initiate io and either get a
> > callback or call another syscall to test if it's complete. Either
> > approach has problems for Postgres. If the process that initiated io
> > is in the middle of a long query it might take a long time, or not even
> > never get back to complete the io. The callbacks use threads...
> >
> > And polling for completion has the problem that another process could
> > be waiting on the io and can't issue a read as long as the first
> > process has the buffer locked and io in progress. I think aio makes a
> > lot more sense if you're using threads so you can start a thread to
> > wait for the io to complete."
>
> I noticed that. I always envisioned async I/O as managed by some
> dedicated process. One that could check for completion or receive
> callbacks. Postmaster, for instance.

Thanks for the mentioning this posting.    Interesting.
However,    the OP describes an implementation based on libaio.
Today what we have (for linux) is librt,  which is quite different.
It is arguable worse than libaio (well actually I am sure it is worse)
since it is essentially just an encapsulation of using threads to do
synchronous ios  -  you can look at it as making it easier to do what the
application could do itself if it set up its own pthreads.     The linux
kernel does not know about it and so the CPU overhead of checking for
completion is higher.
But if async io is used *only* for prefetching, and not for the actual
ReadBuffer itself   (which is what I did),   then the problem
mentioned by the OP
     "If the process that initiated io is in the middle of a long query
        it might take a long time, or never get back to complete the io"
is not a problem.     The model is simple:

1  .    backend process P1 requestrs prefetch on a relation/block R/B
        which results in initating aio_read using
        a (new) shared control block (call it pg_buf_aiocb)
        which tracks the request and also contains the librt's aiocb.
2 .    any backend P2 (which may or may not be P1)
       issues ReadBuffer or similar,
       requesting access for read/pin to buffer containing R/B.
       this backend discovers that the buffer is aio_in_progress
       and calls check_completion(pg_buf_aiocb),
       and waits (effectively on the librt thread) if not complete.

3 .   any number of other backends may concurrently do same as 2.
       I.e. If pg_buf_aiocb is still aio-in-progress, they all also wait
        on the librt thread.
4.    Each waiting backend receives the completion
       and the last one does the housekeeping and returns the pg_buf_aiocb.

What complicates it is managing the associated pinned buffer
in such a way that every backend takes the correct action
with the correct degree of serialization of the buffer descriptor
during critical sections, but yet allowing all backends in
3. above to concurrently wait/check.      After quite a lot of testing
I think I now this correct.   ("I just found the *last* bug!" :-)


John




Re: [PATCH] Prefetch index pages for B-Tree index scans

From
Bruce Momjian
Date:
On Fri, Nov  2, 2012 at 09:59:08AM -0400, John Lumby wrote:
> Thanks for the mentioning this posting.    Interesting.
> However,    the OP describes an implementation based on libaio. 
> Today what we have (for linux) is librt,  which is quite different.
> It is arguable worse than libaio (well actually I am sure it is worse)
> since it is essentially just an encapsulation of using threads to do
> synchronous ios  -  you can look at it as making it easier to do what the 
> application could do itself if it set up its own pthreads.     The linux
> kernel does not know about it and so the CPU overhead of checking for
> completion is higher.

Well, good thing we didn't switch to using libaio, now that it is gone.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [PATCH] Prefetch index pages for B-Tree index scans

From
John Lumby
Date:
Bruce Momjian wrote:
>
> On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote:
> > However,    the OP describes an implementation based on libaio.
> > Today what we have (for linux) is librt,  which is quite different.
>
> Well, good thing we didn't switch to using libaio, now that it is gone.
>
Yes,  I think you are correct.    Although I should correct myself about
status of libaio -  it seems many distros continue to provide it and at least
one other popular database (MySQL) uses it,   but as far as I can tell
the content has not been updated by the original authors for around 10 years.
That is perhaps not surprising since it does very little other than wrap
the linux kernel syscalls.

Set against the CPU-overhead disadvantage of librt,  I think the three
main advantages of librt vs libaio/kernel-aio for postgresql are :
  .   posix standard,  and probably easier to provide very similar
      implementation on windows  (I see at least one posix aio lib for windows)
  .   no restrictions on the way files are accessed  (kernel-aio imposes restrictions
      on open() flags and buffer alignment etc)
  .   it seems (from the recent postings about the earlier attempt to implement
      async io using libaio) that the posix threads style lends itself better to
      fitting in with the postgresql backend model.

John