Thread: Asynchronous I/O Support
Postgre8.1 doesn't seem to support asynchronous I/O. Has its design been thought off already? To tried doing with a simple example: For a Index Nest loop join: Fetch the outer tuples in an array, and then send all the corresponding inner-tuple fetch requests asynchronously. Hence while the IO is done for inner relation the new outer-tuple array can be populated and other join operations can happen. This is maximum overlap we could think of (doing minimal changes). [The current implementation does sync IO, that is it fetches a outer tuple, then requests corresponding inner tuple (waits till it gets), does the processing, get another inner/outer tuple and so on.] We have made appropriate changes in nodeNestloop.c but are unable to track down how it issues the IO and gets the tuple in the slot. Help! -- how to issue a async IO (given kernel 2.6 supports AIO), and does a callback sceme or a sync IO on top of AIO, which of these will be best? Also, as Graefe's paper suggests, a producer-consumer (thread-based) is the best way to do this. But how to implement threading? (in case its possible to?) Sincere regards, Raja Agrawal
On Sun, Oct 15, 2006 at 04:16:07AM +0530, Raja Agrawal wrote: > Postgre8.1 doesn't seem to support asynchronous I/O. Has its design > been thought off already? Sure, I even implemented it once. Didn't get any faster. At that point I realised that my kernel didn't actually support async I/O, and the glibc emulation sucks for anything other than network I/O, so I gave up. Maybe one of these days I should work out if my current system supports it, and give it another go... Have enough systems actually got to the point of actually supporting async I/O that it's worth implementing? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn, On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote: > Have enough systems actually got to the point of actually supporting > async I/O that it's worth implementing? I think there are enough high end applications / systems that need it at this point. The killer use-case we've identified is for the scattered I/O associated with index + heap scans in Postgres. If we can issue ~5-15 I/Os in advance when the TIDs are widely separated it has the potential to increase the I/O speed by the number of disks in the tablespace being scanned. At this point, that pattern will only use one disk. - Luke
On Sun, 2006-10-15 at 19:56 +0200, Martijn van Oosterhout wrote: > Sure, I even implemented it once. Didn't get any faster. Did you just do something akin to s/read/aio_read/ etc., or something more ambitious? I think that really taking advantage of the ability to have multiple I/O requests outstanding would take some leg work. > Maybe one of these days I should work out if my current system supports > it, and give it another go... At least according to [1], kernel AIO on Linux still doesn't work for buffered (i.e. non-O_DIRECT) files. There have been patches available for quite some time that implement this, but I'm not sure when they are likely to get into the mainline kernel. -Neil [1] http://lse.sourceforge.net/io/aio.html
On Sun, Oct 15, 2006 at 02:26:12PM -0400, Neil Conway wrote: > On Sun, 2006-10-15 at 19:56 +0200, Martijn van Oosterhout wrote: > > Sure, I even implemented it once. Didn't get any faster. > > Did you just do something akin to s/read/aio_read/ etc., or something > more ambitious? I think that really taking advantage of the ability to > have multiple I/O requests outstanding would take some leg work. Sure. Basically, at certain strategic points in the code there were extra ReadAsyncBuffer() commands (the IndexScan node and the b-tree scan code). This command was allowed to do nothing, but if there were not too many outstanding requests and a buffer was available, it would allocate a buffer and initiate an AIO request for it. IIRC there was a table of outstanding requests (I think I originally allowed up to 32) and when a normal ReadBuffer() found the block had already been requested, it "waited" on that block. The principle was that the index-scan node would read a page full of tids, submit a ReadAsyncBuffer() on each one, and then proceed as normal. Fairly unintrusive patch all up. ifdeffing it out is safe, and #defineing ReadAsyncBuffer() away causes the compiler to optimise the loop away altogether. The POSIX AIO layer sucks somewhat so it was tricky but it did work. The hardest part is really how to decide if a buffer currently in the buffercache is worth more than an asyncronously loaded buffer that may not be used. I posted the results ot -hackers some time ago, so you can always try that. > At least according to [1], kernel AIO on Linux still doesn't work for > buffered (i.e. non-O_DIRECT) files. There have been patches available > for quite some time that implement this, but I'm not sure when they are > likely to get into the mainline kernel. You can also do it by spawning off threads to do the requests. The glibc emulation uses threads, but only allows one outstanding request per file, which makes it useless for our purposes... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On 10/15/06, Luke Lonergan <llonergan@greenplum.com> wrote: > Martijn, > The killer use-case we've identified is for the scattered I/O associated > with index + heap scans in Postgres. If we can issue ~5-15 I/Os in advance > when the TIDs are widely separated it has the potential to increase the I/O > speed by the number of disks in the tablespace being scanned. At this > point, that pattern will only use one disk. did you have a chance to look at posix_fadvise? merlin
* Neil Conway: > [1] http://lse.sourceforge.net/io/aio.html Last Modified Mon, 07 Jun 2004 12:00:09 GMT But you are right -- it seems that io_submit still blocks without O_DIRECT. *sigh* -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Durlacher Allee 47 tel: +49-721-96201-1 D-76131 Karlsruhe fax: +49-721-96201-99
Have a look at this: [2]http://www-128.ibm.com/developerworks/linux/library/l-async/ This gives a good description of AIO. I'm doing some testing. Will notify, if I get any positive results. Please let me know, if you get any ideas after reading [2]. Regards, Raja On 10/17/06, Florian Weimer <fweimer@bfk.de> wrote: > * Neil Conway: > > > [1] http://lse.sourceforge.net/io/aio.html > > Last Modified Mon, 07 Jun 2004 12:00:09 GMT > > But you are right -- it seems that io_submit still blocks without > O_DIRECT. *sigh* > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Durlacher Allee 47 tel: +49-721-96201-1 > D-76131 Karlsruhe fax: +49-721-96201-99 >
Hi,
"bgwriter doing aysncronous I/O for the dirty buffers that it is supposed to sync"
Another decent use-case?
Regards,
Nikhils
EnterpriseDB http://www.enterprisedb.com
--
All the world's a stage, and most of us are desperately unrehearsed.
"bgwriter doing aysncronous I/O for the dirty buffers that it is supposed to sync"
Another decent use-case?
Regards,
Nikhils
EnterpriseDB http://www.enterprisedb.com
On 10/15/06, Luke Lonergan <llonergan@greenplum.com> wrote:
Martijn,
On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:
> Have enough systems actually got to the point of actually supporting
> async I/O that it's worth implementing?
I think there are enough high end applications / systems that need it at
this point.
The killer use-case we've identified is for the scattered I/O associated
with index + heap scans in Postgres. If we can issue ~5-15 I/Os in advance
when the TIDs are widely separated it has the potential to increase the I/O
speed by the number of disks in the tablespace being scanned. At this
point, that pattern will only use one disk.
- Luke
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
--
All the world's a stage, and most of us are desperately unrehearsed.
NikhilS wrote: > Hi, > > "bgwriter doing aysncronous I/O for the dirty buffers that it is > supposed to sync" > Another decent use-case? > > Regards, > Nikhils > EnterpriseDB http://www.enterprisedb.com > > On 10/15/06, *Luke Lonergan* <llonergan@greenplum.com > <mailto:llonergan@greenplum.com>> wrote: > > Martijn, > > On 10/15/06 10:56 AM, "Martijn van Oosterhout" <kleptog@svana.org > <mailto:kleptog@svana.org>> wrote: > > > Have enough systems actually got to the point of actually supporting > > async I/O that it's worth implementing? > > I think there are enough high end applications / systems that need it at > this point. > > The killer use-case we've identified is for the scattered I/O > associated > with index + heap scans in Postgres. If we can issue ~5-15 I/Os in > advance > when the TIDs are widely separated it has the potential to increase > the I/O > speed by the number of disks in the tablespace being scanned. At this > point, that pattern will only use one disk. > Is it worth considering using readv(2) instead? Cheers Mark
On Wed, Oct 18, 2006 at 08:04:29PM +1300, Mark Kirkwood wrote: > >"bgwriter doing aysncronous I/O for the dirty buffers that it is > >supposed to sync" > >Another decent use-case? Good idea, but async i/o is generally poorly supported. > Is it worth considering using readv(2) instead? Err, readv allows you to split a single consecutive read into multiple buffers. Doesn't help at all for reads on widely areas of a file. Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Sun, Oct 15, 2006 at 14:26:12 -0400, Neil Conway <neilc@samurai.com> wrote: > > At least according to [1], kernel AIO on Linux still doesn't work for > buffered (i.e. non-O_DIRECT) files. There have been patches available > for quite some time that implement this, but I'm not sure when they are > likely to get into the mainline kernel. > > -Neil > > [1] http://lse.sourceforge.net/io/aio.html An improvement is going into 2.6.19 to handle asynchronous vector reads and writes. This was covered by Linux Weekly News a couple of weeks ago: http://lwn.net/Articles/201682/
Hi,
--
All the world's a stage, and most of us are desperately unrehearsed.
On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus Windows.
Guess it would be still worth it, since one fine day 2.6.* will start supporting it properly too.
Regards,
Nikhils
On Wed, Oct 18, 2006 at 08:04:29PM +1300, Mark Kirkwood wrote:
> >"bgwriter doing aysncronous I/O for the dirty buffers that it is
> >supposed to sync"
> >Another decent use-case?
Good idea, but async i/o is generally poorly supported.
Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus Windows.
Regards,
Nikhils
> Is it worth considering using readv(2) instead?
Err, readv allows you to split a single consecutive read into multiple
buffers. Doesn't help at all for reads on widely areas of a file.
Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFFNhtyIB7bNG8LQkwRApNAAJ9mOhEaFqU59HRCCoJS9k9HCZZl5gCdHDWt
FurlswevGH4CWErsjcWmwVk=
=sQoa
-----END PGP SIGNATURE-----
--
All the world's a stage, and most of us are desperately unrehearsed.
> > At least according to [1], kernel AIO on Linux still doesn't work for > > buffered (i.e. non-O_DIRECT) files. There have been patches available > > for quite some time that implement this, but I'm not sure when they > > are likely to get into the mainline kernel. > > > > -Neil > > > > [1] http://lse.sourceforge.net/io/aio.html > > An improvement is going into 2.6.19 to handle asynchronous > vector reads and writes. This was covered by Linux Weekly > News a couple of weeks ago: > http://lwn.net/Articles/201682/ That is orthogonal. We don't really need vector io so much, since we rely on OS readahead. We want asyc IO to tell the OS earlier, that we will need these random pages, and continue our work in the meantime. For random IO it is really important to tell the OS and disk subsystem many pages in parallel so it can optimize head movements and busy more than one disk at a time. Andreas
Zeugswetter Andreas ADI SD wrote: > > An improvement is going into 2.6.19 to handle asynchronous > > vector reads and writes. This was covered by Linux Weekly > > News a couple of weeks ago: > > http://lwn.net/Articles/201682/ > > That is orthogonal. We don't really need vector io so much, since we > rely on OS readahead. We want asyc IO to tell the OS earlier, that we > will need these random pages, and continue our work in the meantime. Of course, you can use asynchronous vector write with a single entry in the vector if you want to perform an asynchronous write. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Oct 20, 2006 at 11:13:33AM +0530, NikhilS wrote: > >Good idea, but async i/o is generally poorly supported. > Async i/o is stably supported on most *nix (apart from Linux 2.6.*) plus > Windows. > Guess it would be still worth it, since one fine day 2.6.* will start > supporting it properly too. Only if it can be shown that async I/O actually results in an improvement. Currently, it's speculation, with the one trial implementation showing little to no improvement. Support is a big word in the face of this initial evidence... :-) It's possible that the PostgreSQL design limits the effectiveness of such things. It's possible that PostgreSQL, having been optimized to not use features such as these, has found a way of operating better, contrary to those who believe that async I/O, threads, and so on, are faster. It's possible that async I/O is supported, but poorly implemented on most systems. Take into account that async I/O doesn't guarantee parallel I/O. The concept of async I/O is that an application can proceed to work on other items while waiting for scheduled work in the background. This can be achieved with a background system thread (GLIBC?). There is no requirement that it actually process the requests in parallel. In fact, any system that did process the requests in parallel, would be easier to run to a halt. For example, for the many systems that do not use RAID, we would potentially end up with scattered reads across the disk all running in parallel, with no priority on the reads, which could mean that data we do not yet need is returned first, causing PostgreSQL to be unable to move forwards. If the process is CPU bound at all, this could be an overall loss. Point being, async I/O isn't a magic bullet. There is no evidence that it would improve the situation on any platform. One would need to consider the PostgreSQL architecture, determine where the bottleneck actually is, and understand why it is a bottleneck fully, before one could decide how to fix it. So, what is the bottleneck? Is PostgreSQL unable to max out the I/O bandwidth? Where? Why? Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
> > >Good idea, but async i/o is generally poorly supported. > Only if it can be shown that async I/O actually results in an > improvement. sure. > fix it. So, what is the bottleneck? Is PostgreSQL unable to > max out the I/O bandwidth? Where? Why? Yup, that would be the scenario where it helps (provided that you have a smart disk or a disk array and an intelligent OS aio implementation). It would be used to fetch the data pages pointed at from an index leaf, or the next level index pages. We measured the IO bandwidth difference on Windows with EMC as beeing nearly proportional to parallel outstanding requests up to at least 16-32. Andreas
On Fri, Oct 20, 2006 at 05:37:48PM +0200, Zeugswetter Andreas ADI SD wrote: > Yup, that would be the scenario where it helps (provided that you have > a smart disk or a disk array and an intelligent OS aio implementation). > It would be used to fetch the data pages pointed at from an index leaf, > or the next level index pages. > We measured the IO bandwidth difference on Windows with EMC as beeing > nearly proportional to parallel outstanding requests up to at least Measured it using what? I was under the impression only one proof-of-implementation existed, and that the scenarios and configuration of the person who wrote it, did not show significant improvement. You have PostgreSQL on Windows with EMC with async I/O support to test with? Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Fri, Oct 20, 2006 at 10:05:01AM -0400, mark@mark.mielke.cc wrote: > Only if it can be shown that async I/O actually results in an improvement. > > Currently, it's speculation, with the one trial implementation showing > little to no improvement. Support is a big word in the face of this > initial evidence... :-) Yeah, the single test so far on a system that didn't support asyncronous I/O doesn't prove anything. It would help if there was a reasonable system that did support async i/o so it could be tested properly. > Point being, async I/O isn't a magic bullet. There is no evidence that it > would improve the situation on any platform. I think it's likely to help with index scan. Prefetching index leaf pages I think could be good. As would prefectching pages from a (bitmap) index scan. It won't help much on very simple queries, but where it should shine is a merge join across two index scans. Currently postgresql would do something like: Loop Fetch left tuple for join Fetch btree leaf Fetch tuple off disk Fetch right tuples for join Fetch btree leaf Fetch tuple off disk Currently it fetches a block fro one file, then a block from the other, back and forth. with async i/o you could read from both files and the indexes simultaneously, thus is theory leading to better i/o throughput. > One would need to consider the PostgreSQL architecture, determine where > the bottleneck actually is, and understand why it is a bottleneck fully, > before one could decide how to fix it. So, what is the bottleneck? Is > PostgreSQL unable to max out the I/O bandwidth? Where? Why? For systems where postgresql is unable to saturate the i/o bandwidth, this is the proposed solution. Are there others? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
> On Fri, Oct 20, 2006 at 10:05:01AM -0400, mark@mark.mielke.cc wrote: >> One would need to consider the PostgreSQL architecture, determine where >> the bottleneck actually is, and understand why it is a bottleneck fully, >> before one could decide how to fix it. So, what is the bottleneck? I think Mark's point is not being taken sufficiently to heart in this thread. It's not difficult at all to think of reasons why attempted read-ahead could be a net loss. One that's bothering me right at the moment is that each such request would require a visit to the shared buffer manager to see if we already have the desired page in buffers. (Unless you think it'd be cheaper to force the kernel to uselessly read the page...) Then another visit when we actually need the page. That means that readahead will double the contention for the buffer manager locks, which is likely to put us right back into the context swap storm problem that we've spent the last couple of releases working out of. So far I've seen no evidence that async I/O would help us, only a lot of wishful thinking. regards, tom lane
On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So far I've seen no evidence that async I/O would help us, only a lot > of wishful thinking. is this thread moot? while researching this thread I came across this article: http://kerneltrap.org/node/6642 describing claims of 30% performance boost when using posix_fadvise to ask the o/s to prefetch data. istm that this kind of improvement is in line with what aio can provide, and posix_fadvise is cleaner, not requiring threads and such. merlin
On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote: > On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >So far I've seen no evidence that async I/O would help us, only a lot > >of wishful thinking. > > is this thread moot? while researching this thread I came across this > article: http://kerneltrap.org/node/6642 describing claims of 30% > performance boost when using posix_fadvise to ask the o/s to prefetch > data. istm that this kind of improvement is in line with what aio can > provide, and posix_fadvise is cleaner, not requiring threads and such. Hmm, my man page says: POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a non-blocking read of the specified region into thepage cache. The amount of data read may be decreased by the kernel depending on VM load. (A few megabytes willusually be fully satisfied, and more is rarely useful.) This appears to be exactly what we want, no? It would be nice to get some idea of what systems support this. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On 10/21/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote: > > On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >So far I've seen no evidence that async I/O would help us, only a lot > > >of wishful thinking. > > > > is this thread moot? while researching this thread I came across this > > article: http://kerneltrap.org/node/6642 describing claims of 30% > > performance boost when using posix_fadvise to ask the o/s to prefetch > > data. istm that this kind of improvement is in line with what aio can > > provide, and posix_fadvise is cleaner, not requiring threads and such. > > Hmm, my man page says: > > POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a > non-blocking read of the specified region into the page cache. > The amount of data read may be decreased by the kernel depending > on VM load. (A few megabytes will usually be fully satisfied, > and more is rarely useful.) > > This appears to be exactly what we want, no? It would be nice to get > some idea of what systems support this. right, and a small clarification: the above claim of 30% was from using adaptive readahead, not posix_fadvise. posix_fadvise was suggested by none other than andrew morton as the way to get the most i/o out of your box. there was no mention of aio :) merlin
Martijn van Oosterhout wrote: -- Start of PGP signed section. > On Fri, Oct 20, 2006 at 03:04:55PM -0400, Merlin Moncure wrote: > > On 10/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >So far I've seen no evidence that async I/O would help us, only a lot > > >of wishful thinking. > > > > is this thread moot? while researching this thread I came across this > > article: http://kerneltrap.org/node/6642 describing claims of 30% > > performance boost when using posix_fadvise to ask the o/s to prefetch > > data. istm that this kind of improvement is in line with what aio can > > provide, and posix_fadvise is cleaner, not requiring threads and such. > > Hmm, my man page says: > > POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a > non-blocking read of the specified region into the page cache. > The amount of data read may be decreased by the kernel depending > on VM load. (A few megabytes will usually be fully satisfied, > and more is rarely useful.) > > This appears to be exactly what we want, no? It would be nice to get > some idea of what systems support this. See our xlog.c for our experience in trying to use it: /* * posix_fadvise is problematic on many platforms: on older x86 Linux it * just dumps core, and there are reportsof problems on PPC platforms as * well. The following is therefore disabled for the time being. We could *consider some kind of configure test to see if it's safe to use, but * since we lack hard evidence that there's any usefulperformance gain to * be had, spending time on that seems unprofitable for now. */ -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> > >So far I've seen no evidence that async I/O would help us, only a lot > > >of wishful thinking. > > > > is this thread moot? while researching this thread I came across this > > article: http://kerneltrap.org/node/6642 describing claims of 30% > > performance boost when using posix_fadvise to ask the o/s to prefetch > > data. istm that this kind of improvement is in line with what aio can > > provide, and posix_fadvise is cleaner, not requiring threads and such. > > Hmm, my man page says: > > POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE both initiate a > non-blocking read of the specified region into the page cache. > The amount of data read may be decreased by the kernel depending > on VM load. (A few megabytes will usually be fully satisfied, > and more is rarely useful.) > > This appears to be exactly what we want, no? It would be nice > to get some idea of what systems support this. POSIX_FADV_WILLNEED definitely sounds very interesting, but: I think this interface was intended to hint larger areas (megabytes). But the "wishful" thinking was not to hint seq scans, but to advise single 8k pages. The OS is responsible for sequential readahead, but it cannot anticipate random access that results from btree access (unless of course we are talking about very small tables). But I doubt, that with this interface many OS's will actually forward multiple IO's to the disk subsystem in parallel, which would be what is needed. Also the comment Bruce quoted does not sound incouraging :-( Andreas
> > So far I've seen no evidence that async I/O would help us, only a lot > > of wishful thinking. > > is this thread moot? while researching this thread I came across this > article: http://kerneltrap.org/node/6642 describing claims of > 30% performance boost when using posix_fadvise to ask the o/s > to prefetch data. istm that this kind of improvement is in > line with what aio can provide, and posix_fadvise is cleaner, > not requiring threads and such. This again is for better OS readahead for sequential access, where standard Linux obviously behaves differently. It is not about random access. Btw. I do understand the opinion from Linux developers, that pg should actually read larger blocks for seq scans. In cases of high disk load OS's tend to not do all needed readahead, which has pros and cons, but mainly cons for pg. Andreas
> > Yup, that would be the scenario where it helps (provided that you have > > a smart disk or a disk array and an intelligent OS aio implementation). > > It would be used to fetch the data pages pointed at from an index > > leaf, or the next level index pages. > > We measured the IO bandwidth difference on Windows with EMC as beeing > > nearly proportional to parallel outstanding requests up to at least > > Measured it using what? I was under the impression only one > proof-of-implementation existed, and that the scenarios and > configuration of the person who wrote it, did not show > significant improvement. IIRC the configuration of that test was not suitable to show any benefit. Minimum requirements to show improvement are:- very few active sessions (typically less than number of disks)- a table thatspans multiple disks (typically on a stripe set) (or one intelligent scsi disk)- only random disk access plans > You have PostgreSQL on Windows with EMC with async I/O > support to test with? No, sorry. Was a MaxDB issue. Andreas
Zeugswetter Andreas ADI SD wrote: > POSIX_FADV_WILLNEED definitely sounds very interesting, but: > > I think this interface was intended to hint larger areas (megabytes). > But the "wishful" thinking was not to hint seq scans, but to advise > single 8k pages. Surely POSIX_FADV_SEQUENTIAL is the one intended to hint seq scans, and POSIX_FADV_RANDOM to hint random access. No? ISTM, _WILLNEED seems just right for small random-access blocks. Anyway, for those who want to see what they do in Linux, http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c Pretty scary that Bruce said it could make older linuxes dump core - there isn't a lot of code there.
On Tue, Oct 24, 2006 at 12:53:23PM -0700, Ron Mayer wrote: > Anyway, for those who want to see what they do in Linux, > http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c > Pretty scary that Bruce said it could make older linuxes > dump core - there isn't a lot of code there. The bug was probably in the glibc interface to the kernel. Google found this: http://sourceware.org/ml/libc-hacker/2004-03/msg00000.html i.e. posix_fadvise appears to have been broken on all 64-bit architechtures prior to March 2004 due to a silly linking error. And then things like this: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313219 Which suggest that prior to glibc 2.3.5, posix_fadvise crashed on 2.4 kernels. That's a fairly recent version, so the bug would still be fairly widespead. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Hi,
While we are at async i/o. I think direct i/o and concurrent i/o also deserve a look at. The archives suggest that Bruce had some misgivings about dio because of no kernel caching, but almost all databases seem to (carefully) use dio (Solaris, Linux, ?) and cio (AIX) extensively nowadays.
Since these can be turned on a per file basis, perf testing them out should be simpler too.
Regards,
Nikhils
On 10/25/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Tue, Oct 24, 2006 at 12:53:23PM -0700, Ron Mayer wrote:
> Anyway, for those who want to see what they do in Linux,
> http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c
> Pretty scary that Bruce said it could make older linuxes
> dump core - there isn't a lot of code there.
The bug was probably in the glibc interface to the kernel. Google found
this:
http://sourceware.org/ml/libc-hacker/2004-03/msg00000.html
i.e. posix_fadvise appears to have been broken on all 64-bit
architechtures prior to March 2004 due to a silly linking error.
And then things like this:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313219
Which suggest that prior to glibc 2.3.5, posix_fadvise crashed on 2.4
kernels. That's a fairly recent version, so the bug would still be
fairly widespead.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFFPnYrIB7bNG8LQkwRAuAqAJ4uqx8y9LxUa9RcEDm7CPwZ2lkS2wCfYxjB
2KzJ7iDYU21lumcZT6cHeLI=
=MzUY
-----END PGP SIGNATURE-----
--
All the world's a stage, and most of us are desperately unrehearsed.