Thread: Multi CPU Queries - Feedback and/or suggestions wanted!
Hi All, we would like to start some work on improving the performance of PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the Faculty of Mathematics and Physics of Charles university in Prague (http://www.mff.cuni.cz) and he is going to cover this topic in his master thesis. He is going to do some investigation in the methods and write down the possibilities and then he is going to implement something from that for PostgreSQL. We want to come out with a serious proposal for this work after collecting the feedback/opinions and doing the more serious investigation. Topics that seem to be of interest and most of them were already discussed at developers meeting in Ottawa are 1.) parallel sorts 2.) parallel query execution 3.) asynchronous I/O 4.) parallel COPY 5.) parallel pg_dump 6.) using threads for parallel processing A scaling with increasing number of CPUs in 1.) and 2.) will face with the I/O bottleneck at some point and the benefit gained here should be nearly the same as for 3.) - the OS or disk could do a better job while scheduling multiple reads from the disk for the same query at the same time. 1.) More merges could be executed on different CPUs. However, one N-way merge on one CPU is probably better than two N/2-way merges on 2 CPUs while sharing the limit of work_mem together for these. This is specific and separate from 2.) or 3.) and if something implemented here it could probably share just the parallel infrastructure code. ======== 2.) Different subtrees (or nodes) of the plan could be executed in parallel on different CPUs and the results of this subtrees could be requested either synchronously or asynchronously. ======== 3.) The simplest possible way is to change the scan nodes that they will send out the asynchronous I/O requests for the next blocks before they manage to run out of tuples in the block they are going through. The more advanced way would arise just by implementing 2.) which will then lead to different scan nodes to be executed on different CPUs at the same time. ======== 4.) and 5.) We do not want to focus here, since there are on-going projects already. ======== 6.) Currently, threads are not used in PostgreSQL (except some cases for Windows OS). Generally using them would bring some problems a) different thread implementations on different OSes b) crash of the whole process if the problem happens in one thread. Backends are isolated and the problem in one backend leads to the graceful shut down of other backends. c) synchronization problems * a) seem just to be more for implementation. Is there any problem with execution of more threads on any supported OS? Like some planning issue that all the threads for the same process end up planned on the same CPU? Or something similar? * b) is fine with using more threads for processing the same query in the same backend - if one crashes others could do the graceful shutdown. * c) does not have to be solved in general because the work of all the threads will be synchronized and we could expect pretty well which data are being accessed by which thread. The memory allocation have to be adjusted to be thread safe and should not affect the performance (Is different memory context for different threads sufficient?). Other common code might need some changes as well. Possibly, the synchronization/critical section exclusion could be done in executor and only if needed. * Using processes instead of threads makes other things more complex - sharing objects between processes might need muchmore coding - more overhead during execution and synchronization ======== It seems to that it makes sense to start working on 2) and 3) and we would like to think of using more threads for processing the same query within one backend. We appreciate feedback, comments and/or suggestions. Cheers Julo
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan. For instance, consider a node involving a scan with an expensive condition, like UTF-8 string length. If you have four CPUs you can bring to bear, each CPU could take every fourth page, computing the expensive condition for each tuple in that page. The results of the scan can be retired asynchronously to the next node above.
-jwb
There is a problem trying to make Postgres do these things in Parallel.
The backend code isn’t thread-safe, so doing a multi-thread implementation requires quite a bit of work.
Using multiple processes has its own problems: The whole way locking works equates one process with one transaction (The proc table is one entry per process). Processes would conflict on locks, deadlocking themselves, as well as many other problems.
It’s all a good idea, but the work is probably far more than you expect.
Async I/O might be easier, if you used pThreads, which is mostly portable, but not to all platforms. (Yes, they do work on Windows)
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jeffrey Baker
Sent: 2008-10-20 22:25
To: Julius Stroffek
Cc: pgsql-hackers@postgresql.org; Dano Vojtek
Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or suggestions wanted!
On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek <Julius.Stroffek@sun.com> wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan. For instance, consider a node involving a scan with an expensive condition, like UTF-8 string length. If you have four CPUs you can bring to bear, each CPU could take every fourth page, computing the expensive condition for each tuple in that page. The results of the scan can be retired asynchronously to the next node above.
-jwb
Hi Jeffrey, thank you for the suggestion. Yes, they potentially can, we'll consider this. Julo Jeffrey Baker wrote: > I don't see why multiple CPUs can't work on the same node of a plan. > For instance, consider a node involving a scan with an expensive > condition, like UTF-8 string length. If you have four CPUs you can > bring to bear, each CPU could take every fourth page, computing the > expensive condition for each tuple in that page. The results of the > scan can be retired asynchronously to the next node above. > > -jwb
I can confirm that bringing Postgres code to multi-thread implementation requires quite a bit of ground work. I have been working for a long while with a Postgres 7.* fork that uses pthreads rather than processes. The effort to make all the subsystems thread safe took some time and touched almost every section of the codebase. I recently spent some time trying to optimize for Chip Multi-Threading systems but focused more on total throughput rather than single query performance. The biggest wins came from changing some coarse grained locks in the page buffering system to a finer grained implementation. I also tried to improve single query performance by splitting index and sequential scans into two threads, one to fault in pages and check tuple visibility and the other for everything else. My success was limited and it was hard for me to work the proper costing into the query optimizer so that it fired at the right times. One place that multiple threads really helped was in index building. My code is poorly commented and the build system is a mess (I am only building 64bit SPARC for embedding into another app). However, I am using it in production and source is available if it's of any help. http://weaver2.dev.java.net Myron Scott On Oct 20, 2008, at 11:28 PM, Chuck McDevitt wrote: > There is a problem trying to make Postgres do these things in > Parallel. > > The backend code isn’t thread-safe, so doing a multi-thread > implementation requires quite a bit of work. > > Using multiple processes has its own problems: The whole way > locking works equates one process with one transaction (The proc > table is one entry per process). Processes would conflict on locks, > deadlocking themselves, as well as many other problems. > > It’s all a good idea, but the work is probably far more than you > expect. > > Async I/O might be easier, if you used pThreads, which is mostly > portable, but not to all platforms. (Yes, they do work on Windows) > > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org > ] On Behalf Of Jeffrey Baker > Sent: 2008-10-20 22:25 > To: Julius Stroffek > Cc: pgsql-hackers@postgresql.org; Dano Vojtek > Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or > suggestions wanted! > > On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek <Julius.Stroffek@sun.com > > wrote: > Topics that seem to be of interest and most of them were already > discussed at developers meeting in Ottawa are > 1.) parallel sorts > 2.) parallel query execution > 3.) asynchronous I/O > 4.) parallel COPY > 5.) parallel pg_dump > 6.) using threads for parallel processing > [...] > 2.) > Different subtrees (or nodes) of the plan could be executed in > parallel > on different CPUs and the results of this subtrees could be requested > either synchronously or asynchronously. > > I don't see why multiple CPUs can't work on the same node of a > plan. For instance, consider a node involving a scan with an > expensive condition, like UTF-8 string length. If you have four > CPUs you can bring to bear, each CPU could take every fourth page, > computing the expensive condition for each tuple in that page. The > results of the scan can be retired asynchronously to the next node > above. > > -jwb
On Mon, 2008-10-20 at 21:05 +0200, Julius Stroffek wrote: > He is going to do some investigation in the methods and > write down the possibilities and then he is going to implement > something from that for PostgreSQL. When will this work be complete? We are days away from completing main work on 8.4, so you won't get much discussion on this for a few months yet. Will it be complete in time for 8.5? Or much earlier even? Julius, you don't mention what your role is in this. In what sense is Dano's master's thesis a "we" thing? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Julius Stroffek wrote: > Hi All, > > we would like to start some work on improving the performance of > PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the > Faculty of Mathematics and Physics of Charles university in Prague > (http://www.mff.cuni.cz) and he is going to cover this topic in his > master thesis. He is going to do some investigation in the methods and > write down the possibilities and then he is going to implement something > from that for PostgreSQL. > > We want to come out with a serious proposal for this work after > collecting the feedback/opinions and doing the more serious investigation. Exciting stuff, and clearly a direction we need to explore. > Topics that seem to be of interest and most of them were already > discussed at developers meeting in Ottawa are > 1.) parallel sorts > 2.) parallel query execution > 3.) asynchronous I/O I think the current plan is to use posix_advise() to allow parallel I/O, rather than async I/O becuase posix_advise() will require fewer code changes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 23 Oct 2008, Bruce Momjian wrote: > I think the current plan is to use posix_advise() to allow parallel I/O, > rather than async I/O becuase posix_advise() will require fewer code > changes. These are not necessarily mutually exclusive designs. fadvise works fine on Linux, but as far as I know only async I/O works on Solaris. Linux also has an async I/O library, and it's not clear to me yet whether that might work even better than the fadvise approach. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote: >> I think the current plan is to use posix_advise() to allow parallel I/O, >> rather than async I/O becuase posix_advise() will require fewer code >> changes. > > These are not necessarily mutually exclusive designs. fadvise works fine on > Linux, but as far as I know only async I/O works on Solaris. Linux also has > an async I/O library, and it's not clear to me yet whether that might work > even better than the fadvise approach. fadvise is a kludge. While it will help, it still makes us completely reliant on the OS. For performance reasons, we should be supporting a multi-block read directly into shared buffers. IIRC, we currently have support for rings in the buffer pool, which we could read directly into. Though, an LRU-based buffer manager design would be more optimal in this case. -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote: > >> I think the current plan is to use posix_advise() to allow parallel I/O, > >> rather than async I/O becuase posix_advise() will require fewer code > >> changes. > > > > These are not necessarily mutually exclusive designs. fadvise works fine on > > Linux, but as far as I know only async I/O works on Solaris. Linux also has > > an async I/O library, and it's not clear to me yet whether that might work > > even better than the fadvise approach. > > fadvise is a kludge. While it will help, it still makes us completely > reliant on the OS. For performance reasons, we should be supporting a > multi-block read directly into shared buffers. IIRC, we currently > have support for rings in the buffer pool, which we could read > directly into. Though, an LRU-based buffer manager design would be > more optimal in this case. True, it is a kludge but if it gives us 95% of the benfit with 10% of the code, it is a win. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I couldn't get async I/O to work on Linux. That is it "worked" but performed the same as reading one block at a time. On solaris the situation is reversed. In what way is fadvise a kludge? greg On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote: > Jonah H. Harris wrote: >> On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> >> wrote: >>>> I think the current plan is to use posix_advise() to allow >>>> parallel I/O, >>>> rather than async I/O becuase posix_advise() will require fewer >>>> code >>>> changes. >>> >>> These are not necessarily mutually exclusive designs. fadvise >>> works fine on >>> Linux, but as far as I know only async I/O works on Solaris. >>> Linux also has >>> an async I/O library, and it's not clear to me yet whether that >>> might work >>> even better than the fadvise approach. >> >> fadvise is a kludge. While it will help, it still makes us >> completely >> reliant on the OS. For performance reasons, we should be >> supporting a >> multi-block read directly into shared buffers. IIRC, we currently >> have support for rings in the buffer pool, which we could read >> directly into. Though, an LRU-based buffer manager design would be >> more optimal in this case. > > True, it is a kludge but if it gives us 95% of the benfit with 10% of > the code, it is a win. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark wrote: > I couldn't get async I/O to work on Linux. That is it "worked" but > performed the same as reading one block at a time. On solaris the > situation is reversed. > > In what way is fadvise a kludge? I think he is saying AIO gives us more flexibility, but I am unsure we need it. --------------------------------------------------------------------------- > > greg > > On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote: > > > Jonah H. Harris wrote: > >> On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> > >> wrote: > >>>> I think the current plan is to use posix_advise() to allow > >>>> parallel I/O, > >>>> rather than async I/O becuase posix_advise() will require fewer > >>>> code > >>>> changes. > >>> > >>> These are not necessarily mutually exclusive designs. fadvise > >>> works fine on > >>> Linux, but as far as I know only async I/O works on Solaris. > >>> Linux also has > >>> an async I/O library, and it's not clear to me yet whether that > >>> might work > >>> even better than the fadvise approach. > >> > >> fadvise is a kludge. While it will help, it still makes us > >> completely > >> reliant on the OS. For performance reasons, we should be > >> supporting a > >> multi-block read directly into shared buffers. IIRC, we currently > >> have support for rings in the buffer pool, which we could read > >> directly into. Though, an LRU-based buffer manager design would be > >> more optimal in this case. > > > > True, it is a kludge but if it gives us 95% of the benfit with 10% of > > the code, it is a win. > > > > -- > > Bruce Momjian <bruce@momjian.us> http://momjian.us > > EnterpriseDB http://enterprisedb.com > > > > + If your life is a hard drive, Christ can be your backup. + > > > > -- > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote: > True, it is a kludge but if it gives us 95% of the benfit with 10% of > the code, it is a win. I'd say, optimistically, maybe 30-45% the benefit over a proper multi-block read using O_DIRECT. -- Jonah H. Harris, Senior DBA myYearbook.com
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark <greg.stark@enterprisedb.com> wrote: > I couldn't get async I/O to work on Linux. That is it "worked" but performed > the same as reading one block at a time. On solaris the situation is > reversed. Hmm, then obviously you did something wrong, because my tests showed it quite well. Pull the source to iozone or fio. > In what way is fadvise a kludge? non-portable, requires more user-to-system CPU, ... need I go on? -- Jonah H. Harris, Senior DBA myYearbook.com
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark >> In what way is fadvise a kludge? > non-portable, requires more user-to-system CPU, ... need I go on? I'd be interested to know which of these proposals you claim *is* portable. The single biggest reason to reject 'em all is that they aren't. regards, tom lane
On Fri, Oct 24, 2008 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> non-portable, requires more user-to-system CPU, ... need I go on? > > I'd be interested to know which of these proposals you claim *is* > portable. The single biggest reason to reject 'em all is that > they aren't. Yes, that was bad wording on my part. What I mean to say was unpredictable. Different OSes and filesystems handle fadvise differently (or not at all), which makes any claim to performance gain configuration-dependent. My preferred method, using O_DIRECT and fetching directly into shared buffers, is not without its issues or challenges as well. However, by abstracting the multi-block read interface, we could use more optimal calls depending on the OS. Having done a bit of research and testing in this area (AIO and buffer management), I don't see any easy solution. fadvise will work on some systems and will likely give some gain on them, but won't work for everyone. The alternative is to abstract prefetching and allow platform-specific code, which we rarely do. While we could build an abstract prefetch interface and simply use fadvise for it now (rather than OS-specific code), I don't see an easy win in any case. -- Jonah H. Harris, Senior DBA myYearbook.com
On 24 Oct 2008, at 04:31 AM, "Jonah H. Harris" <jonah.harris@gmail.com> wrote: > On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark > <greg.stark@enterprisedb.com> wrote: >> I couldn't get async I/O to work on Linux. That is it "worked" but >> performed >> the same as reading one block at a time. On solaris the situation is >> reversed. > > Hmm, then obviously you did something wrong, because my tests showed > it quite well. Pull the source to iozone or fio. I posted the source, feel free to point out what I did wrong. It did work on solaris with and without o_direct so I didn't think it was a bug in my code. >> In what way is fadvise a kludge? > > non-portable, requires more user-to-system CPU, ... need I go on? Well it's just as portable, they're both specified by posix. Actually async I/o is in the real-time extensions so one could argue it's less portable. Also before posix_fadvise there was plain old fadvise so it's portable to older platforms too whereas async I/o isn't. Posix_fadvise does require two syscalls and two trips to the buffer manager. But that doesn't really make it a kludge if the resulting code is cleaner than the async I/o code would be. To use async I/o we would have to pin all the buffers we're reading which would be quite a lot of code changes. I did ask for feedback on precisely this point of whether two trips to the buffer manager was a problem. It would have been nice to get the feedback 6 months ago when I posted it instead of now two weeks before feature freeze.
Based on what? I did test this and posted the data. The results I posted showed that posix_fadvise on Linux performed nearly as well on Linux as async I/O on Solaris on identical hardware. More importantly it scaled with the number if drives. A 15 drive array gets about 15x the performance of a 1 drive array if enough read-ahead is done. Plus an extra boost if the input wasn't already sorted which presumably reflects the better i/o ordering. -- greg On 24 Oct 2008, at 04:29 AM, "Jonah H. Harris" <jonah.harris@gmail.com> wrote: > On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> > wrote: >> True, it is a kludge but if it gives us 95% of the benfit with 10% of >> the code, it is a win. > > I'd say, optimistically, maybe 30-45% the benefit over a proper > multi-block read using O_DIRECT. > > -- > Jonah H. Harris, Senior DBA > myYearbook.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
We did discuss this in Ottawa and I beleive your comment was "the hint is in the name" referring to posix_fadvise. In any case both aio and posix_fadvise are specified by posix so I don't see either as a problem on that front. I don't think we can ignore any longer that we effectively can't use raid arrays with postgres. If you have many concurrent queries or restrict yourself to sequential scans you're ok but if you're doing data warehousing you're going to be pretty disappointed to see your shiny raid array performing like a single drive. greg On 24 Oct 2008, at 05:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark >>> In what way is fadvise a kludge? > >> non-portable, requires more user-to-system CPU, ... need I go on? > > I'd be interested to know which of these proposals you claim *is* > portable. The single biggest reason to reject 'em all is that > they aren't. > > regards, tom lane
Jonah H. Harris wrote: > fadvise is a kludge. I don't think it's a kludge at all. posix_fadvise() is a pretty nice and clean interface to hint the kernel what pages you're going to access in the near future. I can't immediately come up with a cleaner interface to do that. Compared to async I/O, it's helluva lot simpler to add a few posix_fadvise() calls to an application, than switch to a completely different paradigm. And while posix_fadvise() is just a hint, allowing the OS to prioritize accordingly, all async I/O requests look the same. > While it will help, it still makes us completely > reliant on the OS. That's not a bad thing in my opinion. The OS knows the I/O hardware, disk layout, utilization, and so forth, and is in a much better position to do I/O scheduling than a user process. The only advantage a user process has is that it knows better what pages it's going to need, and posix_fadvise() is a good interface to let the user process tell the kernel that. > IIRC, we currently have support for rings in the buffer pool, which we could read > directly into. The rings won't help you a bit. It's just a different way to choose victim buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote: > On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote: >> True, it is a kludge but if it gives us 95% of the benfit with 10% of >> the code, it is a win. > > I'd say, optimistically, maybe 30-45% the benefit over a proper > multi-block read using O_DIRECT. Let's try to focus. We're not talking about using O_DIRECT, we're talking about using asynchronous I/O or posix_fadvise(). And without more details on what you mean by benefit, under what circumstances, any numbers like that is just unmeasurable handwaving. In terms of getting the RAID array busy, in Greg's tests posix_fadvise() on Linux worked just as well as async I/O works on Solaris. So it doesn't seem like there's any inherent performance advantage in the async I/O interface over posix_fadvise() + read(), There is differences between different OS implementations of the interfaces. But we're developing software for the future, and for a wide range of platforms, and I'm sure operating systems will develop as well. The decision should not be made on what is the fastest interface on a given operating system in 2008. Async I/O might have a small potential edge on CPU usage, because less system calls are needed. However, let me remind you all that we're talking about how to utilize RAID array to do physical, random, I/O as fast as possible. IOW, the bottleneck is I/O, by definition. The CPU efficiency of the kernel interface to initiate the I/O is insignificant, until we reach a large enough random read throughput to saturate the CPU, and even then there's probably more significant CPU savings to be made elsewhere. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote: > While we could build an > abstract prefetch interface and simply use fadvise for it now (rather > than OS-specific code), I don't see an easy win in any case. When building an abstract interface, always use at least two implementations (I guess that would be fadvise on linux and AIO on solaris in this case). You are much more likely to get the interface right this way. -------------- Hannu
On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote: >> While we could build an >> abstract prefetch interface and simply use fadvise for it now (rather >> than OS-specific code), I don't see an easy win in any case. > > When building an abstract interface, always use at least two > implementations (I guess that would be fadvise on linux and AIO on > solaris in this case). You are much more likely to get the interface > right this way. I agree, I just wasn't sure as to whether Greg's patch supported both methods. -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote: > >> While we could build an > >> abstract prefetch interface and simply use fadvise for it now (rather > >> than OS-specific code), I don't see an easy win in any case. > > > > When building an abstract interface, always use at least two > > implementations (I guess that would be fadvise on linux and AIO on > > solaris in this case). You are much more likely to get the interface > > right this way. > > I agree, I just wasn't sure as to whether Greg's patch supported both methods. It does not, and probably will not for the near future; we can only hope Solaris suports posix_fadvise() at some point. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I thought about how to support both and ran into probblems that would make the resulting solutions quite complex. In the libaio view of the world you initiate io and either get a callback or call another syscall to test if it's complete. Either approach has problems for postgres. If the process that initiated io is in the middle of a long query it might takr a long time ot even never get back to complete the io. The callbacks use threads... And polling for completion has the problem that another process could be waiting on the io and can't issue a read as long as the first process has the buffer locked and io in progress. I think aio makes a lot more sense if you're using threads so you can start a thread to wait for the io to complete. Actually I think it might be doable with a lot of work but I'm worried that it would be a lot of extra complexity even when you're not using it. The current patch doesn't change anything when you're not using it and actually is quite simple. greg On 24 Oct 2008, at 03:18 PM, Bruce Momjian <bruce@momjian.us> wrote: > Jonah H. Harris wrote: >> On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing >> <hannu@2ndquadrant.com> wrote: >>> On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote: >>>> While we could build an >>>> abstract prefetch interface and simply use fadvise for it now >>>> (rather >>>> than OS-specific code), I don't see an easy win in any case. >>> >>> When building an abstract interface, always use at least two >>> implementations (I guess that would be fadvise on linux and AIO on >>> solaris in this case). You are much more likely to get the interface >>> right this way. >> >> I agree, I just wasn't sure as to whether Greg's patch supported >> both methods. > > It does not, and probably will not for the near future; we can only > hope Solaris suports posix_fadvise() at some point. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. +
Also keep in mind that solaris is open source these days. If someone wants it they could always go ahead and add the feature ... greg On 24 Oct 2008, at 03:18 PM, Bruce Momjian <bruce@momjian.us> wrote: > Jonah H. Harris wrote: >> On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing >> <hannu@2ndquadrant.com> wrote: >>> On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote: >>>> While we could build an >>>> abstract prefetch interface and simply use fadvise for it now >>>> (rather >>>> than OS-specific code), I don't see an easy win in any case. >>> >>> When building an abstract interface, always use at least two >>> implementations (I guess that would be fadvise on linux and AIO on >>> solaris in this case). You are much more likely to get the interface >>> right this way. >> >> I agree, I just wasn't sure as to whether Greg's patch supported >> both methods. > > It does not, and probably will not for the near future; we can only > hope Solaris suports posix_fadvise() at some point. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. +
* Greg Stark <greg.stark@enterprisedb.com> [081024 10:48]: > I thought about how to support both and ran into probblems that would > make the resulting solutions quite complex. > > In the libaio view of the world you initiate io and either get a > callback or call another syscall to test if it's complete. Either > approach has problems for postgres. If the process that initiated io > is in the middle of a long query it might takr a long time ot even > never get back to complete the io. The callbacks use threads... > > And polling for completion has the problem that another process could > be waiting on the io and can't issue a read as long as the first > process has the buffer locked and io in progress. I think aio makes a > lot more sense if you're using threads so you can start a thread to > wait for the io to complete. > > Actually I think it might be doable with a lot of work but I'm worried > that it would be a lot of extra complexity even when you're not using > it. The current patch doesn't change anything when you're not using it > and actually is quite simple. In the Solaris async IO, are you bound by direct IO? Does the OS page-cache still get primed by async reads? If so, how about starting async IO into a "throwaway" local buffer; treat async IO in the same way as fadvise, a "pre-load the OS page cache so the real read is quick". Sure, I understand it's not the "perfect model", but it I don't see PostgreSQL being refactored enough to have a pure async model happening any time in the near future... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Hi Simon,<br /><br /><blockquote cite="mid:1224662306.27145.204.camel@ebony.2ndQuadrant" type="cite"><blockquote type="cite"><prewrap="">He is going to do some investigation in the methods and write down the possibilities and then he is going to implement something from that for PostgreSQL. </pre></blockquote><pre wrap=""> When will this work be complete? We are days away from completing main work on 8.4, so you won't get much discussion on this for a few months yet. Will it be complete in time for 8.5? Or much earlier even? </pre></blockquote> The first guess is that the work willbe done for 8.6. Dano is supposed to finish the work and defend his thesis in something a bit more than a year.<br /><blockquotecite="mid:1224662306.27145.204.camel@ebony.2ndQuadrant" type="cite"><pre wrap="">Julius, you don't mention whatyour role is in this. In what sense is Dano's master's thesis a "we" thing? </pre></blockquote> I am Dano's mentor and we have a closed contact with Zdenek as well.We would like the project to become a "we" thing as another reason why to work on the project. It seems to be betterto research some ideas at the begging and discuss the stuff during development than just individually writing somepiece of code which could be published afterwards. Especially, when this area seems to be of interest of more people.<br/><br /> Cheers<br /><br /> Julo<br />
>Hi Simon, > >He is going to do some investigation in the methods and >write down the possibilities and then he is going to implement >something from that for PostgreSQL. > >When will this work be complete? We are days away from completing main >work on 8.4, so you won't get much discussion on this for a few months >yet. Will it be complete in time for 8.5? Or much earlier even? > >The first guess is that the work will be done for 8.6. >Dano is supposed to finish the work and defend his thesis in something a bit more than a year. > >Julius, you don't mention what your role is in this. In what sense is >Dano's master's thesis a "we" thing? > >I am Dano's mentor and we have a closed contact with Zdenek as well. >We would like the project to become a "we" thing as another reason why to work on the project. >It seems to be better to research some ideas at the begging and discuss the stuff >during development than just individually writing some piece of code which could >be published afterwards. Especially, when this area seems to be of interest of >more people. Threads are where future performance is going to come from: General purpose-> http://www.setup32.com/hardware/cpuchipset/32core-processors-intel-reach e.php GPU-> http://wwwx.cs.unc.edu/~lastra/Research/GPU_performance.html http://www.cs.unc.edu/~geom/GPUSORT/results.html Database engines that want to exploit the ultimate in performance will utilize multiple threads of execution. True, the same thing can be realized by multiple processes, but a process is more expensive than a thread.
Bruce Momjian wrote: > Greg Stark wrote: > >> I couldn't get async I/O to work on Linux. That is it "worked" but >> performed the same as reading one block at a time. On solaris the >> situation is reversed. >> >> In what way is fadvise a kludge? >> > > I think he is saying AIO gives us more flexibility, but I am unsure we > need it. > absolutely. posix_fadvise is easy to implement and i would assume that it takes away a lot of "guessing" on the OS internals side. the database usually knows that it is gonna read a lot of data in a certain way and it cannot be a bad idea to give the kernel a hint here. especially synchronized seq scans and so on are real winners here as you stop confusing the kernel with XX concurrent readers on the same file. this can also be an issue with some controller firmwares and so on. many thanks, hans