Thread: Asynchronous I/O Support

Asynchronous I/O Support

From

"Raja Agrawal"

Date:

15 October 2006, 14:42:07

Postgre8.1 doesn't seem to support asynchronous I/O. Has its design
been thought off already?

To tried doing with a simple example:
For a Index Nest loop join:
Fetch the outer tuples in an array, and then send all the
corresponding inner-tuple fetch requests asynchronously. Hence while
the IO is done for inner relation the new outer-tuple array can be
populated and other join operations can happen. This is maximum
overlap we could think of (doing minimal changes).

[The current implementation does sync IO, that is it fetches a outer
tuple, then requests corresponding inner tuple (waits till it gets),
does the processing, get another inner/outer tuple and so on.]

We have made appropriate changes in nodeNestloop.c but are unable to
track down how it issues the IO and gets the tuple in the slot.

Help! -- how to issue a async IO (given kernel 2.6 supports AIO), and
does a callback sceme or a sync IO on top of AIO, which of these will be best?

Also, as Graefe's paper suggests, a producer-consumer (thread-based)
is the best way to do this. But how to implement threading? (in case
its possible to?)

Sincere regards,
Raja Agrawal

Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

15 October 2006, 14:56:42

On Sun, Oct 15, 2006 at 04:16:07AM +0530, Raja Agrawal wrote:
> Postgre8.1 doesn't seem to support asynchronous I/O. Has its design
> been thought off already?

Sure, I even implemented it once. Didn't get any faster. At that point
I realised that my kernel didn't actually support async I/O, and the
glibc emulation sucks for anything other than network I/O, so I gave
up.

Maybe one of these days I should work out if my current system supports
it, and give it another go...

Have enough systems actually got to the point of actually supporting
async I/O that it's worth implementing?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Asynchronous I/O Support

From

"Luke Lonergan"

Date:

15 October 2006, 15:27:29

Martijn,

On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:

> Have enough systems actually got to the point of actually supporting
> async I/O that it's worth implementing?

I think there are enough high end applications / systems that need it at
this point.

The killer use-case we've identified is for the scattered I/O associated
with index + heap scans in Postgres.  If we can issue ~5-15 I/Os in advance
when the TIDs are widely separated it has the potential to increase the I/O
speed by the number of disks in the tablespace being scanned.  At this
point, that pattern will only use one disk.

- Luke

Re: Asynchronous I/O Support

From

Neil Conway

Date:

15 October 2006, 15:35:38

On Sun, 2006-10-15 at 19:56 +0200, Martijn van Oosterhout wrote:
> Sure, I even implemented it once. Didn't get any faster.

Did you just do something akin to s/read/aio_read/ etc., or something
more ambitious? I think that really taking advantage of the ability to
have multiple I/O requests outstanding would take some leg work.

> Maybe one of these days I should work out if my current system supports
> it, and give it another go...

At least according to [1], kernel AIO on Linux still doesn't work for
buffered (i.e. non-O_DIRECT) files. There have been patches available
for quite some time that implement this, but I'm not sure when they are
likely to get into the mainline kernel.

-Neil

[1] http://lse.sourceforge.net/io/aio.html

Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

15 October 2006, 16:44:23

On Sun, Oct 15, 2006 at 02:26:12PM -0400, Neil Conway wrote:
> On Sun, 2006-10-15 at 19:56 +0200, Martijn van Oosterhout wrote:
> > Sure, I even implemented it once. Didn't get any faster.
>
> Did you just do something akin to s/read/aio_read/ etc., or something
> more ambitious? I think that really taking advantage of the ability to
> have multiple I/O requests outstanding would take some leg work.

Sure. Basically, at certain strategic points in the code there were
extra ReadAsyncBuffer() commands (the IndexScan node and the b-tree
scan code). This command was allowed to do nothing, but if there were
not too many outstanding requests and a buffer was available, it would
allocate a buffer and initiate an AIO request for it.

IIRC there was a table of outstanding requests (I think I originally
allowed up to 32) and when a normal ReadBuffer() found the block had
already been requested, it "waited" on that block.

The principle was that the index-scan node would read a page full of
tids, submit a ReadAsyncBuffer() on each one, and then proceed as
normal. Fairly unintrusive patch all up. ifdeffing it out is safe, and
#defineing ReadAsyncBuffer() away causes the compiler to optimise the
loop away altogether.

The POSIX AIO layer sucks somewhat so it was tricky but it did work.
The hardest part is really how to decide if a buffer currently in the
buffercache is worth more than an asyncronously loaded buffer that may
not be used.

I posted the results ot -hackers some time ago, so you can always try
that.

> At least according to [1], kernel AIO on Linux still doesn't work for
> buffered (i.e. non-O_DIRECT) files. There have been patches available
> for quite some time that implement this, but I'm not sure when they are
> likely to get into the mainline kernel.

You can also do it by spawning off threads to do the requests. The
glibc emulation uses threads, but only allows one outstanding request
per file, which makes it useless for our purposes...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Asynchronous I/O Support

From

"Merlin Moncure"

Date:

17 October 2006, 14:18:26

On 10/15/06, Luke Lonergan <llonergan@greenplum.com> wrote:
> Martijn,
> The killer use-case we've identified is for the scattered I/O associated
> with index + heap scans in Postgres.  If we can issue ~5-15 I/Os in advance
> when the TIDs are widely separated it has the potential to increase the I/O
> speed by the number of disks in the tablespace being scanned.  At this
> point, that pattern will only use one disk.

did you have a chance to look at posix_fadvise?

merlin

Re: Asynchronous I/O Support

From

Florian Weimer

Date:

17 October 2006, 14:53:19

* Neil Conway:

> [1] http://lse.sourceforge.net/io/aio.html

Last Modified     Mon, 07 Jun 2004 12:00:09 GMT

But you are right -- it seems that io_submit still blocks without
O_DIRECT. *sigh*

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Durlacher Allee 47            tel: +49-721-96201-1
D-76131 Karlsruhe             fax: +49-721-96201-99

Re: Asynchronous I/O Support

From

"Raja Agrawal"

Date:

17 October 2006, 15:29:36

Have a look at this:
[2]http://www-128.ibm.com/developerworks/linux/library/l-async/

This gives a good description of AIO.

I'm doing some testing. Will notify, if I get any positive results.

Please let me know, if you get any ideas after reading [2].

Regards,
Raja

On 10/17/06, Florian Weimer <fweimer@bfk.de> wrote:
> * Neil Conway:
>
> > [1] http://lse.sourceforge.net/io/aio.html
>
> Last Modified     Mon, 07 Jun 2004 12:00:09 GMT
>
> But you are right -- it seems that io_submit still blocks without
> O_DIRECT. *sigh*
>
> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Durlacher Allee 47            tel: +49-721-96201-1
> D-76131 Karlsruhe             fax: +49-721-96201-99
>

Re: Asynchronous I/O Support

From

NikhilS

Date:

18 October 2006, 03:05:24

Hi,

"bgwriter doing aysncronous I/O for the dirty buffers that it is supposed to sync"
Another decent use-case?

Regards,
Nikhils
EnterpriseDB http://www.enterprisedb.com

On 10/15/06, Luke Lonergan <llonergan@greenplum.com> wrote:

Martijn,

On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:

> Have enough systems actually got to the point of actually supporting
> async I/O that it's worth implementing?

I think there are enough high end applications / systems that need it at
this point.

The killer use-case we've identified is for the scattered I/O associated
with index + heap scans in Postgres.  If we can issue ~5-15 I/Os in advance
when the TIDs are widely separated it has the potential to increase the I/O
speed by the number of disks in the tablespace being scanned.  At this
point, that pattern will only use one disk.

- Luke

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

--
All the world's a stage, and most of us are desperately unrehearsed.

Re: Asynchronous I/O Support

From

Mark Kirkwood

Date:

18 October 2006, 04:05:17

NikhilS wrote:
> Hi,
> 
> "bgwriter doing aysncronous I/O for the dirty buffers that it is 
> supposed to sync"
> Another decent use-case?
> 
> Regards,
> Nikhils
> EnterpriseDB   http://www.enterprisedb.com
> 
> On 10/15/06, *Luke Lonergan* <llonergan@greenplum.com 
> <mailto:llonergan@greenplum.com>> wrote:
> 
>     Martijn,
> 
>     On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org
>     <mailto:kleptog@svana.org>> wrote:
> 
>      > Have enough systems actually got to the point of actually supporting
>      > async I/O that it's worth implementing?
> 
>     I think there are enough high end applications / systems that need it at
>     this point.
> 
>     The killer use-case we've identified is for the scattered I/O
>     associated
>     with index + heap scans in Postgres.  If we can issue ~5-15 I/Os in
>     advance
>     when the TIDs are widely separated it has the potential to increase
>     the I/O
>     speed by the number of disks in the tablespace being scanned.  At this
>     point, that pattern will only use one disk.
> 

Is it worth considering using readv(2) instead?

Cheers

Mark

Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

18 October 2006, 09:22:13

On Wed, Oct 18, 2006 at 08:04:29PM +1300, Mark Kirkwood wrote:
> >"bgwriter doing aysncronous I/O for the dirty buffers that it is
> >supposed to sync"
> >Another decent use-case?

Good idea, but async i/o is generally poorly supported.

> Is it worth considering using readv(2) instead?

Err, readv allows you to split a single consecutive read into multiple
buffers. Doesn't help at all for reads on widely areas of a file.

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Asynchronous I/O Support

From

Bruno Wolff III

Date:

19 October 2006, 17:16:43

On Sun, Oct 15, 2006 at 14:26:12 -0400, Neil Conway <neilc@samurai.com> wrote:
> 
> At least according to [1], kernel AIO on Linux still doesn't work for
> buffered (i.e. non-O_DIRECT) files. There have been patches available
> for quite some time that implement this, but I'm not sure when they are
> likely to get into the mainline kernel.
> 
> -Neil
> 
> [1] http://lse.sourceforge.net/io/aio.html

An improvement is going into 2.6.19 to handle asynchronous vector reads
and writes. This was covered by Linux Weekly News a couple of weeks ago:
http://lwn.net/Articles/201682/

Re: Asynchronous I/O Support

From

NikhilS

Date:

20 October 2006, 02:43:38

Hi,

On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Wed, Oct 18, 2006 at 08:04:29PM +1300, Mark Kirkwood wrote:
> >"bgwriter doing aysncronous I/O for the dirty buffers that it is
> >supposed to sync"
> >Another decent use-case?

Good idea, but async i/o is generally poorly supported.

Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus Windows.

Guess it would be still worth it, since one fine day 2.6.* will start supporting it properly too.

Regards,
Nikhils

> Is it worth considering using readv(2) instead?

Err, readv allows you to split a single consecutive read into multiple
buffers. Doesn't help at all for reads on widely areas of a file.

Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFNhtyIB7bNG8LQkwRApNAAJ9mOhEaFqU59HRCCoJS9k9HCZZl5gCdHDWt
FurlswevGH4CWErsjcWmwVk=
=sQoa
-----END PGP SIGNATURE-----

--
All the world's a stage, and most of us are desperately unrehearsed.

Re: Asynchronous I/O Support

From

"Zeugswetter Andreas ADI SD"

Date:

20 October 2006, 04:50:43

> > At least according to [1], kernel AIO on Linux still doesn't work
for
> > buffered (i.e. non-O_DIRECT) files. There have been patches
available
> > for quite some time that implement this, but I'm not sure when they
> > are likely to get into the mainline kernel.
> >
> > -Neil
> >
> > [1] http://lse.sourceforge.net/io/aio.html
>
> An improvement is going into 2.6.19 to handle asynchronous
> vector reads and writes. This was covered by Linux Weekly
> News a couple of weeks ago:
> http://lwn.net/Articles/201682/

That is orthogonal. We don't really need vector io so much, since we
rely
on OS readahead. We want asyc IO to tell the OS earlier, that we will
need
these random pages, and continue our work in the meantime.
For random IO it is really important to tell the OS and disk subsystem
many pages in parallel so it can optimize head movements and busy more
than
one disk at a time.

Andreas

Re: Asynchronous I/O Support

From

Alvaro Herrera

Date:

20 October 2006, 09:47:52

Zeugswetter Andreas ADI SD wrote:

> > An improvement is going into 2.6.19 to handle asynchronous 
> > vector reads and writes. This was covered by Linux Weekly 
> > News a couple of weeks ago:
> > http://lwn.net/Articles/201682/
> 
> That is orthogonal. We don't really need vector io so much, since we
> rely on OS readahead. We want asyc IO to tell the OS earlier, that we
> will need these random pages, and continue our work in the meantime.

Of course, you can use asynchronous vector write with a single entry in
the vector if you want to perform an asynchronous write.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [SPAM?] Re: Asynchronous I/O Support

From

mark@mark.mielke.cc

Date:

20 October 2006, 11:05:14

On Fri, Oct 20, 2006 at 11:13:33AM +0530, NikhilS wrote:
> >Good idea, but async i/o is generally poorly supported.

> Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus
> Windows.
> Guess it would be still worth it, since one fine day 2.6.* will start
> supporting it properly too.

Only if it can be shown that async I/O actually results in an improvement.

Currently, it's speculation, with the one trial implementation showing
little to no improvement. Support is a big word in the face of this
initial evidence... :-)

It's possible that the PostgreSQL design limits the effectiveness of
such things. It's possible that PostgreSQL, having been optimized to not
use features such as these, has found a way of operating better,
contrary to those who believe that async I/O, threads, and so on, are
faster. It's possible that async I/O is supported, but poorly implemented
on most systems.

Take into account that async I/O doesn't guarantee parallel I/O. The
concept of async I/O is that an application can proceed to work on other
items while waiting for scheduled work in the background. This can be
achieved with a background system thread (GLIBC?). There is no requirement
that it actually process the requests in parallel. In fact, any system that
did process the requests in parallel, would be easier to run to a halt.
For example, for the many systems that do not use RAID, we would potentially
end up with scattered reads across the disk all running in parallel, with
no priority on the reads, which could mean that data we do not yet need
is returned first, causing PostgreSQL to be unable to move forwards. If
the process is CPU bound at all, this could be an overall loss.

Point being, async I/O isn't a magic bullet. There is no evidence that it
would improve the situation on any platform.

One would need to consider the PostgreSQL architecture, determine where
the bottleneck actually is, and understand why it is a bottleneck fully,
before one could decide how to fix it. So, what is the bottleneck? Is
PostgreSQL unable to max out the I/O bandwidth? Where? Why?

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Zeugswetter Andreas ADI SD"

Date:

20 October 2006, 12:38:04

> > >Good idea, but async i/o is generally poorly supported.

> Only if it can be shown that async I/O actually results in an
> improvement.

sure.

> fix it. So, what is the bottleneck? Is PostgreSQL unable to
> max out the I/O bandwidth? Where? Why?

Yup, that would be the scenario where it helps (provided that you have
a smart disk or a disk array and an intelligent OS aio implementation).
It would be used to fetch the data pages pointed at from an index leaf,
or the next level index pages.
We measured the IO bandwidth difference on Windows with EMC as beeing
nearly proportional to parallel outstanding requests up to at least
16-32.

Andreas

Re: [SPAM?] Re: Asynchronous I/O Support

From

mark@mark.mielke.cc

Date:

20 October 2006, 14:11:24

On Fri, Oct 20, 2006 at 05:37:48PM +0200, Zeugswetter Andreas ADI SD wrote:
> Yup, that would be the scenario where it helps (provided that you have
> a smart disk or a disk array and an intelligent OS aio implementation).
> It would be used to fetch the data pages pointed at from an index leaf,
> or the next level index pages.
> We measured the IO bandwidth difference on Windows with EMC as beeing 
> nearly proportional to parallel outstanding requests up to at least

Measured it using what? I was under the impression only one
proof-of-implementation existed, and that the scenarios and
configuration of the person who wrote it, did not show significant
improvement.

You have PostgreSQL on Windows with EMC with async I/O support to
test with?

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: [SPAM?] Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

20 October 2006, 14:58:33

On Fri, Oct 20, 2006 at 10:05:01AM -0400, mark@mark.mielke.cc wrote:
> Only if it can be shown that async I/O actually results in an improvement.
>
> Currently, it's speculation, with the one trial implementation showing
> little to no improvement. Support is a big word in the face of this
> initial evidence... :-)

Yeah, the single test so far on a system that didn't support
asyncronous I/O doesn't prove anything. It would help if there was a
reasonable system that did support async i/o so it could be tested
properly.

> Point being, async I/O isn't a magic bullet. There is no evidence that it
> would improve the situation on any platform.

I think it's likely to help with index scan. Prefetching index leaf
pages I think could be good. As would prefectching pages from a
(bitmap) index scan.

It won't help much on very simple queries, but where it should shine is
a merge join across two index scans. Currently postgresql would do
something like:

Loop Fetch left tuple for join   Fetch btree leaf     Fetch tuple off disk Fetch right tuples for join   Fetch btree
leaf    Fetch tuple off disk 

Currently it fetches a block fro one file, then a block from the other,
back and forth. with async i/o you could read from both files and the
indexes simultaneously, thus is theory leading to better i/o
throughput.

> One would need to consider the PostgreSQL architecture, determine where
> the bottleneck actually is, and understand why it is a bottleneck fully,
> before one could decide how to fix it. So, what is the bottleneck? Is
> PostgreSQL unable to max out the I/O bandwidth? Where? Why?

For systems where postgresql is unable to saturate the i/o bandwidth,
this is the proposed solution. Are there others?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [SPAM?] Re: Asynchronous I/O Support

From

Tom Lane

Date:

20 October 2006, 15:21:30

> On Fri, Oct 20, 2006 at 10:05:01AM -0400, mark@mark.mielke.cc wrote:
>> One would need to consider the PostgreSQL architecture, determine where
>> the bottleneck actually is, and understand why it is a bottleneck fully,
>> before one could decide how to fix it. So, what is the bottleneck?

I think Mark's point is not being taken sufficiently to heart in this
thread.

It's not difficult at all to think of reasons why attempted read-ahead
could be a net loss.  One that's bothering me right at the moment is
that each such request would require a visit to the shared buffer
manager to see if we already have the desired page in buffers.  (Unless
you think it'd be cheaper to force the kernel to uselessly read the
page...)  Then another visit when we actually need the page.  That means
that readahead will double the contention for the buffer manager locks,
which is likely to put us right back into the context swap storm problem
that we've spent the last couple of releases working out of.

So far I've seen no evidence that async I/O would help us, only a lot
of wishful thinking.
        regards, tom lane

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Merlin Moncure"

Date:

20 October 2006, 16:05:08

On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So far I've seen no evidence that async I/O would help us, only a lot
> of wishful thinking.

is this thread moot?  while researching this thread I came across this
article: http://kerneltrap.org/node/6642 describing claims of 30%
performance boost when using posix_fadvise to ask the o/s to prefetch
data.  istm that this kind of improvement is in line with what aio can
provide, and posix_fadvise is cleaner, not requiring threads and such.

merlin

Re: [SPAM?] Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

20 October 2006, 16:11:03

On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote:
> On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >So far I've seen no evidence that async I/O would help us, only a lot
> >of wishful thinking.
>
> is this thread moot?  while researching this thread I came across this
> article: http://kerneltrap.org/node/6642 describing claims of 30%
> performance boost when using posix_fadvise to ask the o/s to prefetch
> data.  istm that this kind of improvement is in line with what aio can
> provide, and posix_fadvise is cleaner, not requiring threads and such.

Hmm, my man page says:
      POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a      non-blocking read of the specified region into
thepage cache.       The amount of data read may be decreased by the kernel depending      on VM load. (A few megabytes
willusually be fully satisfied,      and more is rarely useful.) 

This appears to be exactly what we want, no? It would be nice to get
some idea of what systems support this.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Merlin Moncure"

Date:

20 October 2006, 21:52:16

On 10/21/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote:
> > On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >So far I've seen no evidence that async I/O would help us, only a lot
> > >of wishful thinking.
> >
> > is this thread moot?  while researching this thread I came across this
> > article: http://kerneltrap.org/node/6642 describing claims of 30%
> > performance boost when using posix_fadvise to ask the o/s to prefetch
> > data.  istm that this kind of improvement is in line with what aio can
> > provide, and posix_fadvise is cleaner, not requiring threads and such.
>
> Hmm, my man page says:
>
>        POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a
>        non-blocking read of the specified region into the page cache.
>        The amount of data read may be decreased by the kernel depending
>        on VM load. (A few megabytes will usually be fully satisfied,
>        and more is rarely useful.)
>
> This appears to be exactly what we want, no? It would be nice to get
> some idea of what systems support this.

right, and a small clarification: the above claim of 30% was from
using adaptive readahead, not posix_fadvise.  posix_fadvise was
suggested by none other than andrew morton as the way to get the most
i/o out of your box.  there was no mention of aio :)

merlin

Re: [SPAM?] Re: Asynchronous I/O Support

From

Bruce Momjian

Date:

21 October 2006, 16:20:00

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote:
> > On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >So far I've seen no evidence that async I/O would help us, only a lot
> > >of wishful thinking.
> > 
> > is this thread moot?  while researching this thread I came across this
> > article: http://kerneltrap.org/node/6642 describing claims of 30%
> > performance boost when using posix_fadvise to ask the o/s to prefetch
> > data.  istm that this kind of improvement is in line with what aio can
> > provide, and posix_fadvise is cleaner, not requiring threads and such.
> 
> Hmm, my man page says:
> 
>        POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a
>        non-blocking read of the specified region into the page cache. 
>        The amount of data read may be decreased by the kernel depending
>        on VM load. (A few megabytes will usually be fully satisfied,
>        and more is rarely useful.)
> 
> This appears to be exactly what we want, no? It would be nice to get
> some idea of what systems support this.

See our xlog.c for our experience in trying to use it:
   /*    * posix_fadvise is problematic on many platforms: on older x86 Linux it    * just dumps core, and there are
reportsof problems on PPC platforms as    * well.  The following is therefore disabled for the time being. We could
*consider some kind of configure test to see if it's safe to use, but    * since we lack hard evidence that there's any
usefulperformance gain to    * be had, spending time on that seems unprofitable for now.    */
 

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Zeugswetter Andreas ADI SD"

Date:

23 October 2006, 04:38:23

> > >So far I've seen no evidence that async I/O would help us, only a
lot
> > >of wishful thinking.
> >
> > is this thread moot?  while researching this thread I came across
this
> > article: http://kerneltrap.org/node/6642 describing claims of 30%
> > performance boost when using posix_fadvise to ask the o/s to
prefetch
> > data.  istm that this kind of improvement is in line with what aio
can
> > provide, and posix_fadvise is cleaner, not requiring threads and
such.
>
> Hmm, my man page says:
>
>        POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a
>        non-blocking read of the specified region into the page cache.
>        The amount of data read may be decreased by the kernel
depending
>        on VM load. (A few megabytes will usually be fully satisfied,
>        and more is rarely useful.)
>
> This appears to be exactly what we want, no? It would be nice
> to get some idea of what systems support this.

POSIX_FADV_WILLNEED definitely sounds very interesting, but:

I think this interface was intended to hint larger areas (megabytes).
But the "wishful" thinking was not to hint seq scans, but to advise
single 8k pages.
The OS is responsible for sequential readahead, but it cannot anticipate
random access that results from btree access (unless of course we are
talking about
very small tables).

But I doubt, that with this interface many OS's will actually forward
multiple IO's
to the disk subsystem in parallel, which would be what is needed.
Also the comment Bruce quoted does not sound incouraging :-(

Andreas

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Zeugswetter Andreas ADI SD"

Date:

23 October 2006, 04:47:34

> > So far I've seen no evidence that async I/O would help us, only a
lot
> > of wishful thinking.
>
> is this thread moot?  while researching this thread I came across this
> article: http://kerneltrap.org/node/6642 describing claims of
> 30% performance boost when using posix_fadvise to ask the o/s
> to prefetch data.  istm that this kind of improvement is in
> line with what aio can provide, and posix_fadvise is cleaner,
> not requiring threads and such.

This again is for better OS readahead for sequential access, where
standard Linux obviously behaves differently. It is not about random
access.

Btw. I do understand the opinion from Linux developers, that pg should
actually
read larger blocks for seq scans. In cases of high disk load OS's tend
to not
do all needed readahead, which has pros and cons, but mainly cons for
pg.

Andreas

Re: [SPAM?] Re: Asynchronous I/O Support

From

"Zeugswetter Andreas ADI SD"

Date:

23 October 2006, 05:00:43

> > Yup, that would be the scenario where it helps (provided that you
have
> > a smart disk or a disk array and an intelligent OS aio
implementation).
> > It would be used to fetch the data pages pointed at from an index
> > leaf, or the next level index pages.
> > We measured the IO bandwidth difference on Windows with EMC as
beeing
> > nearly proportional to parallel outstanding requests up to at least
>
> Measured it using what? I was under the impression only one
> proof-of-implementation existed, and that the scenarios and
> configuration of the person who wrote it, did not show
> significant improvement.

IIRC the configuration of that test was not suitable to show any
benefit.
Minimum requirements to show improvement are:- very few active sessions (typically less than number of disks)- a table
thatspans multiple disks (typically on a stripe set)   (or one intelligent scsi disk)- only random disk access plans 
> You have PostgreSQL on Windows with EMC with async I/O
> support to test with?

No, sorry. Was a MaxDB issue.

Andreas

Re: [SPAM?] Re: Asynchronous I/O Support

From

Ron Mayer

Date:

24 October 2006, 16:53:53

Zeugswetter Andreas ADI SD wrote:
> POSIX_FADV_WILLNEED definitely sounds very interesting, but:
> 
> I think this interface was intended to hint larger areas (megabytes).
> But the "wishful" thinking was not to hint seq scans, but to advise
> single 8k pages.

Surely POSIX_FADV_SEQUENTIAL is the one intended to hint seq scans,
and POSIX_FADV_RANDOM to hint random access.  No?

ISTM, _WILLNEED seems just right for small random-access blocks.

Anyway, for those who want to see what they do in Linux, http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c
Pretty scary that Bruce said it could make older linuxes
dump core - there isn't a lot of code there.

Re: [SPAM?] Re: Asynchronous I/O Support

From

Martijn van Oosterhout

Date:

24 October 2006, 17:23:37

On Tue, Oct 24, 2006 at 12:53:23PM -0700, Ron Mayer wrote:
> Anyway, for those who want to see what they do in Linux,
>   http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c
> Pretty scary that Bruce said it could make older linuxes
> dump core - there isn't a lot of code there.

The bug was probably in the glibc interface to the kernel. Google found
this:

http://sourceware.org/ml/libc-hacker/2004-03/msg00000.html

i.e. posix_fadvise appears to have been broken on all 64-bit
architechtures prior to March 2004 due to a silly linking error.

And then things like this:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313219

Which suggest that prior to glibc 2.3.5, posix_fadvise crashed on 2.4
kernels. That's a fairly recent version, so the bug would still be
fairly widespead.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: [SPAM?] Re: Asynchronous I/O Support

From

NikhilS

Date:

25 October 2006, 04:17:37

Hi,

While we are at async i/o. I think direct i/o and concurrent i/o also deserve a look at. The archives suggest that Bruce had some misgivings about dio because of no kernel caching, but almost all databases seem to (carefully) use dio (Solaris, Linux, ?) and cio (AIX) extensively nowadays.

Since these can be turned on a per file basis, perf testing them out should be simpler too.

Regards,

Nikhils

On 10/25/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Tue, Oct 24, 2006 at 12:53:23PM -0700, Ron Mayer wrote:
> Anyway, for those who want to see what they do in Linux,
> http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c
> Pretty scary that Bruce said it could make older linuxes
> dump core - there isn't a lot of code there.

The bug was probably in the glibc interface to the kernel. Google found
this:

http://sourceware.org/ml/libc-hacker/2004-03/msg00000.html

i.e. posix_fadvise appears to have been broken on all 64-bit
architechtures prior to March 2004 due to a silly linking error.

And then things like this:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313219

Which suggest that prior to glibc 2.3.5, posix_fadvise crashed on 2.4
kernels. That's a fairly recent version, so the bug would still be
fairly widespead.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFPnYrIB7bNG8LQkwRAuAqAJ4uqx8y9LxUa9RcEDm7CPwZ2lkS2wCfYxjB
2KzJ7iDYU21lumcZT6cHeLI=
=MzUY
-----END PGP SIGNATURE-----

--
All the world's a stage, and most of us are desperately unrehearsed.