Thread: Re: Extended Prefetching using Asynchronous IO - proposal and patch
The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.
As mentioned before, if anyone wishes to try this feature out
on 9.3.4, I will be making a patch for that soon
which I can supply on request.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.
As mentioned before, if anyone wishes to try this feature out
on 9.3.4, I will be making a patch for that soon
which I can supply on request.
John Lumby
Attachment
On 05/28/2014 11:52 PM, John Lumby wrote: > > The patch is attached. > It is based on clone of today's 9.4dev source. > I have noticed that this source is > (not suprisingly) quite a moving target at present, > meaning that this patch becomes stale quite quickly. > So although this copy is fine for reviewing, > it may quite probably soon not be correct > for the current source tree. > > As mentioned before, if anyone wishes to try this feature out > on 9.3.4, I will be making a patch for that soon > which I can supply on request. Wow, that's a huge patch. I took a very brief look, focusing on the basic design. ignoring the style & other minor things for now: The patch seems to assume that you can put the aiocb struct in shared memory, initiate an asynchronous I/O request from one process, and wait for its completion from another process. I'm pretty surprised if that works on any platform. How portable is POSIX aio nowadays? Googling around, it still seems that on Linux, it's implemented using threads. Does the thread-emulation implementation cause problems with the rest of the backend, which assumes that there is only a single thread? In any case, I think we'll want to encapsulate the AIO implementation behind some kind of an API, to allow other implementations to co-exist. Benchmarks? - Heikki
Thanks for looking at it!
> Date: Thu, 29 May 2014 00:19:33 +0300
> From: hlinnakangas@vmware.com
> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> CC: klaussfreire@gmail.com
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 05/28/2014 11:52 PM, John Lumby wrote:
> >
>
> The patch seems to assume that you can put the aiocb struct in shared
> memory, initiate an asynchronous I/O request from one process, and wait
> for its completion from another process. I'm pretty surprised if that
> works on any platform.
It works on linux. Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()
This also plays very nicely with the syncscan where the cohorts run close together.
If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good, as I can't verify it myself.
>
> How portable is POSIX aio nowadays? Googling around, it still seems that
> on Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which
> assumes that there is only a single thread? In any case, I think we'll
Good question, I am not aware of any dependency on a backend having only
a single thread. Can you please point me to such dependencies?
> want to encapsulate the AIO implementation behind some kind of an API,
> to allow other implementations to co-exist.
It is already - it follows the smgr(stg mgr) -> md (mag disk) -> fd (filesystem)
layering used for the existing filesystem ops including posix-fadvise.
>
> Benchmarks?
I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs ...
> - Heikki
> From: hlinnakangas@vmware.com
> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> CC: klaussfreire@gmail.com
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 05/28/2014 11:52 PM, John Lumby wrote:
> >
>
> The patch seems to assume that you can put the aiocb struct in shared
> memory, initiate an asynchronous I/O request from one process, and wait
> for its completion from another process. I'm pretty surprised if that
> works on any platform.
It works on linux. Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()
This also plays very nicely with the syncscan where the cohorts run close together.
If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good, as I can't verify it myself.
>
> How portable is POSIX aio nowadays? Googling around, it still seems that
> on Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which
> assumes that there is only a single thread? In any case, I think we'll
Good question, I am not aware of any dependency on a backend having only
a single thread. Can you please point me to such dependencies?
> want to encapsulate the AIO implementation behind some kind of an API,
> to allow other implementations to co-exist.
It is already - it follows the smgr(stg mgr) -> md (mag disk) -> fd (filesystem)
layering used for the existing filesystem ops including posix-fadvise.
>
> Benchmarks?
I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs ...
> - Heikki
On 05/29/2014 04:12 PM, John Lumby wrote: > Thanks for looking at it! > >> Date: Thu, 29 May 2014 00:19:33 +0300 >> From: hlinnakangas@vmware.com >> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org >> CC: klaussfreire@gmail.com >> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch >> >> On 05/28/2014 11:52 PM, John Lumby wrote: >>> >> >> The patch seems to assume that you can put the aiocb struct in shared >> memory, initiate an asynchronous I/O request from one process, and wait >> for its completion from another process. I'm pretty surprised if that >> works on any platform. > > It works on linux. Actually this ability allows the asyncio implementation to > reduce complexity in one respect (yes I know it looks complex enough) : > it makes waiting for completion of an in-progress IO simpler than for > the existing synchronous IO case,. since librt takes care of the waiting. > specifically, no need for extra wait-for-io control blocks > such as in bufmgr's WaitIO() [checks]. No, it doesn't work. See attached test program. It kinda seems to work sometimes, because of the way it's implemented in glibc. The aiocb struct has a field for the result value and errno, and when the I/O is finished, the worker thread fills them in. aio_error() and aio_return() just return the values of those fields, so calling aio_error() or aio_return() do in fact happen to work from a different process. aio_suspend(), however, is implemented by sleeping on a process-local mutex, which does not work from a different process. Even if it worked on Linux today, it would be a bad idea to rely on it from a portability point of view. No, the only sane way to make this work is that the process that initiates an I/O request is responsible for completing it. If another process needs to wait for an async I/O to complete, we must use some other means to do the waiting. Like the io_in_progress_lock that we already have, for the same purpose. - Heikki
Attachment
On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/29/2014 04:12 PM, John Lumby wrote: >> >> Thanks for looking at it! >> >>> Date: Thu, 29 May 2014 00:19:33 +0300 >>> From: hlinnakangas@vmware.com >>> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org >>> CC: klaussfreire@gmail.com >>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - >>> proposal and patch >>> >>> On 05/28/2014 11:52 PM, John Lumby wrote: >>>> >>>> >>> >>> The patch seems to assume that you can put the aiocb struct in shared >>> memory, initiate an asynchronous I/O request from one process, and wait >>> for its completion from another process. I'm pretty surprised if that >>> works on any platform. >> >> >> It works on linux. Actually this ability allows the asyncio >> implementation to >> reduce complexity in one respect (yes I know it looks complex enough) : >> it makes waiting for completion of an in-progress IO simpler than for >> the existing synchronous IO case,. since librt takes care of the >> waiting. >> specifically, no need for extra wait-for-io control blocks >> such as in bufmgr's WaitIO() > > > [checks]. No, it doesn't work. See attached test program. > > It kinda seems to work sometimes, because of the way it's implemented in > glibc. The aiocb struct has a field for the result value and errno, and when > the I/O is finished, the worker thread fills them in. aio_error() and > aio_return() just return the values of those fields, so calling aio_error() > or aio_return() do in fact happen to work from a different process. > aio_suspend(), however, is implemented by sleeping on a process-local mutex, > which does not work from a different process. > > Even if it worked on Linux today, it would be a bad idea to rely on it from > a portability point of view. No, the only sane way to make this work is that > the process that initiates an I/O request is responsible for completing it. > If another process needs to wait for an async I/O to complete, we must use > some other means to do the waiting. Like the io_in_progress_lock that we > already have, for the same purpose. But calls to it are timeouted by 10us, effectively turning the thing into polling mode. Which is odd now that I read carefully, is how come 256 waits of 10us amounts to anything? That's just 2.5ms, not enough IIUC for any normal I/O to complete.
On 05/29/2014 11:34 PM, Claudio Freire wrote: > On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 05/29/2014 04:12 PM, John Lumby wrote: >>> >>>> On 05/28/2014 11:52 PM, John Lumby wrote: >>>> >>>> The patch seems to assume that you can put the aiocb struct in shared >>>> memory, initiate an asynchronous I/O request from one process, and wait >>>> for its completion from another process. I'm pretty surprised if that >>>> works on any platform. >>> >>> It works on linux. Actually this ability allows the asyncio >>> implementation to >>> reduce complexity in one respect (yes I know it looks complex enough) : >>> it makes waiting for completion of an in-progress IO simpler than for >>> the existing synchronous IO case,. since librt takes care of the >>> waiting. >>> specifically, no need for extra wait-for-io control blocks >>> such as in bufmgr's WaitIO() >> >> [checks]. No, it doesn't work. See attached test program. >> >> It kinda seems to work sometimes, because of the way it's implemented in >> glibc. The aiocb struct has a field for the result value and errno, and when >> the I/O is finished, the worker thread fills them in. aio_error() and >> aio_return() just return the values of those fields, so calling aio_error() >> or aio_return() do in fact happen to work from a different process. >> aio_suspend(), however, is implemented by sleeping on a process-local mutex, >> which does not work from a different process. >> >> Even if it worked on Linux today, it would be a bad idea to rely on it from >> a portability point of view. No, the only sane way to make this work is that >> the process that initiates an I/O request is responsible for completing it. >> If another process needs to wait for an async I/O to complete, we must use >> some other means to do the waiting. Like the io_in_progress_lock that we >> already have, for the same purpose. > > But calls to it are timeouted by 10us, effectively turning the thing > into polling mode. We don't want polling... And even if we did, calling aio_suspend() in a way that's known to be broken, in a loop, is a pretty crappy way of polling. > Which is odd now that I read carefully, is how come 256 waits of 10us > amounts to anything? That's just 2.5ms, not enough IIUC for any normal > I/O to complete Wild guess: the buffer you're reading is already in OS cache. - Heikki
On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/29/2014 11:34 PM, Claudio Freire wrote: >> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> >>> On 05/29/2014 04:12 PM, John Lumby wrote: >>>> >>>> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote: >>>>> >>>>> The patch seems to assume that you can put the aiocb struct in shared >>>>> memory, initiate an asynchronous I/O request from one process, and wait >>>>> for its completion from another process. I'm pretty surprised if that >>>>> works on any platform. >>>> >>>> >>>> It works on linux. Actually this ability allows the asyncio >>>> implementation to >>>> reduce complexity in one respect (yes I know it looks complex enough) : >>>> it makes waiting for completion of an in-progress IO simpler than for >>>> the existing synchronous IO case,. since librt takes care of the >>>> waiting. >>>> specifically, no need for extra wait-for-io control blocks >>>> such as in bufmgr's WaitIO() >>> >>> >>> [checks]. No, it doesn't work. See attached test program. >>> >>> It kinda seems to work sometimes, because of the way it's implemented in >>> glibc. The aiocb struct has a field for the result value and errno, and >>> when >>> the I/O is finished, the worker thread fills them in. aio_error() and >>> aio_return() just return the values of those fields, so calling >>> aio_error() >>> or aio_return() do in fact happen to work from a different process. >>> aio_suspend(), however, is implemented by sleeping on a process-local >>> mutex, >>> which does not work from a different process. >>> >>> Even if it worked on Linux today, it would be a bad idea to rely on it >>> from >>> a portability point of view. No, the only sane way to make this work is >>> that >>> the process that initiates an I/O request is responsible for completing >>> it. >>> If another process needs to wait for an async I/O to complete, we must >>> use >>> some other means to do the waiting. Like the io_in_progress_lock that we >>> already have, for the same purpose. >> >> >> But calls to it are timeouted by 10us, effectively turning the thing >> into polling mode. > > > We don't want polling... And even if we did, calling aio_suspend() in a way > that's known to be broken, in a loop, is a pretty crappy way of polling. Agreed. Just saying that that's why it works. The PID of the process responsible for the aio should be there somewhere, and other processes should explicitly synchronize (or initiate their own I/O and let the OS merge them - which also works). > > >> Which is odd now that I read carefully, is how come 256 waits of 10us >> amounts to anything? That's just 2.5ms, not enough IIUC for any normal >> I/O to complete > > > Wild guess: the buffer you're reading is already in OS cache. My wild guess as well.
On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/29/2014 11:34 PM, Claudio Freire wrote: >> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> >>> On 05/29/2014 04:12 PM, John Lumby wrote: >>>> >>>> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote: >>>>> >>>>> The patch seems to assume that you can put the aiocb struct in shared >>>>> memory, initiate an asynchronous I/O request from one process, and wait >>>>> for its completion from another process. I'm pretty surprised if that >>>>> works on any platform. >>>> >>>> >>>> It works on linux. Actually this ability allows the asyncio >>>> implementation to >>>> reduce complexity in one respect (yes I know it looks complex enough) : >>>> it makes waiting for completion of an in-progress IO simpler than for >>>> the existing synchronous IO case,. since librt takes care of the >>>> waiting. >>>> specifically, no need for extra wait-for-io control blocks >>>> such as in bufmgr's WaitIO() >>> >>> >>> [checks]. No, it doesn't work. See attached test program. >>> >>> It kinda seems to work sometimes, because of the way it's implemented in >>> glibc. The aiocb struct has a field for the result value and errno, and >>> when >>> the I/O is finished, the worker thread fills them in. aio_error() and >>> aio_return() just return the values of those fields, so calling >>> aio_error() >>> or aio_return() do in fact happen to work from a different process. >>> aio_suspend(), however, is implemented by sleeping on a process-local >>> mutex, >>> which does not work from a different process. >>> >>> Even if it worked on Linux today, it would be a bad idea to rely on it >>> from >>> a portability point of view. No, the only sane way to make this work is >>> that >>> the process that initiates an I/O request is responsible for completing >>> it. >>> If another process needs to wait for an async I/O to complete, we must >>> use >>> some other means to do the waiting. Like the io_in_progress_lock that we >>> already have, for the same purpose. >> >> >> But calls to it are timeouted by 10us, effectively turning the thing >> into polling mode. > > > We don't want polling... And even if we did, calling aio_suspend() in a way > that's known to be broken, in a loop, is a pretty crappy way of polling. Didn't fix that, but the attached patch does fix regression tests when scanning over index types other than btree (was invoking elog when the index am didn't have ampeeknexttuple)
Attachment
Claudio Freire <klaussfreire@gmail.com> writes: > Didn't fix that, but the attached patch does fix regression tests when > scanning over index types other than btree (was invoking elog when the > index am didn't have ampeeknexttuple) "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe for non-MVCC snapshots (read about vacuum vs indexscan interlocks in nbtree/README). regards, tom lane
On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> Didn't fix that, but the attached patch does fix regression tests when >> scanning over index types other than btree (was invoking elog when the >> index am didn't have ampeeknexttuple) > > "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe > for non-MVCC snapshots (read about vacuum vs indexscan interlocks in > nbtree/README). It's not really the tuple, just the tid
On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Claudio Freire <klaussfreire@gmail.com> writes: >>> Didn't fix that, but the attached patch does fix regression tests when >>> scanning over index types other than btree (was invoking elog when the >>> index am didn't have ampeeknexttuple) >> >> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe >> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in >> nbtree/README). > > > It's not really the tuple, just the tid And, furthermore, it's used only to do prefetching, so even if the tid was invalid when the tuple needs to be accessed, it wouldn't matter, because the indexam wouldn't use the result of ampeeknexttuple to do anything at that time. Though, your comment does illustrate the need to document that on ampeeknexttuple, for future users.
> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 05/29/2014 11:34 PM, Claudio Freire wrote:
> >>
> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> >> <hlinnakangas@vmware.com> wrote:
> >>>
> >>> On 05/29/2014 04:12 PM, John Lumby wrote:
> >>>>
> >>>>
> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
> >>>>>
> >>>>> The patch seems to assume that you can put the aiocb struct in shared
> >>>>> memory, initiate an asynchronous I/O request from one process, and wait
> >>>>> for its completion from another process. I'm pretty surprised if that
> >>>>> works on any platform.
> >>>>
> >>>>
> >>>> It works on linux. Actually this ability allows the asyncio
> >>>> implementation to
> >>>> reduce complexity in one respect (yes I know it looks complex enough) :
> >>>> it makes waiting for completion of an in-progress IO simpler than for
> >>>> the existing synchronous IO case,. since librt takes care of the
> >>>> waiting.
> >>>> specifically, no need for extra wait-for-io control blocks
> >>>> such as in bufmgr's WaitIO()
> >>>
> >>>
> >>> [checks]. No, it doesn't work. See attached test program.
Thanks for checking and thanks for coming up with that test program.
However, yes, it really does work -- always (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However, the rule is, call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.
See the code changes to fd.c function FileCompleteaio()
to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.
> >>>
> >>> It kinda seems to work sometimes, because of the way it's implemented in
> >>> glibc. The aiocb struct has a field for the result value and errno, and
> >>> when
> >>> the I/O is finished, the worker thread fills them in. aio_error() and
> >>> aio_return() just return the values of those fields, so calling
> >>> aio_error()
> >>> or aio_return() do in fact happen to work from a different process.
> >>> aio_suspend(), however, is implemented by sleeping on a process-local
> >>> mutex,
> >>> which does not work from a different process.
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> >>
> >>
> >> But calls to it are timeouted by 10us, effectively turning the thing
> >> into polling mode.
> >
> >
> > We don't want polling... And even if we did, calling aio_suspend() in a way
> > that's known to be broken, in a loop, is a pretty crappy way of polling.
Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can verify that.
Where are you (Claudio) seeing 10us?
>
>
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 05/29/2014 11:34 PM, Claudio Freire wrote:
> >>
> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> >> <hlinnakangas@vmware.com> wrote:
> >>>
> >>> On 05/29/2014 04:12 PM, John Lumby wrote:
> >>>>
> >>>>
> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
> >>>>>
> >>>>> The patch seems to assume that you can put the aiocb struct in shared
> >>>>> memory, initiate an asynchronous I/O request from one process, and wait
> >>>>> for its completion from another process. I'm pretty surprised if that
> >>>>> works on any platform.
> >>>>
> >>>>
> >>>> It works on linux. Actually this ability allows the asyncio
> >>>> implementation to
> >>>> reduce complexity in one respect (yes I know it looks complex enough) :
> >>>> it makes waiting for completion of an in-progress IO simpler than for
> >>>> the existing synchronous IO case,. since librt takes care of the
> >>>> waiting.
> >>>> specifically, no need for extra wait-for-io control blocks
> >>>> such as in bufmgr's WaitIO()
> >>>
> >>>
> >>> [checks]. No, it doesn't work. See attached test program.
Thanks for checking and thanks for coming up with that test program.
However, yes, it really does work -- always (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However, the rule is, call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.
See the code changes to fd.c function FileCompleteaio()
to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.
> >>>
> >>> It kinda seems to work sometimes, because of the way it's implemented in
> >>> glibc. The aiocb struct has a field for the result value and errno, and
> >>> when
> >>> the I/O is finished, the worker thread fills them in. aio_error() and
> >>> aio_return() just return the values of those fields, so calling
> >>> aio_error()
> >>> or aio_return() do in fact happen to work from a different process.
> >>> aio_suspend(), however, is implemented by sleeping on a process-local
> >>> mutex,
> >>> which does not work from a different process.
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> >>
> >>
> >> But calls to it are timeouted by 10us, effectively turning the thing
> >> into polling mode.
> >
> >
> > We don't want polling... And even if we did, calling aio_suspend() in a way
> > that's known to be broken, in a loop, is a pretty crappy way of polling.
Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can verify that.
Where are you (Claudio) seeing 10us?
>
>
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)
Attachment
Claudio Freire <klaussfreire@gmail.com> writes: > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in >>> nbtree/README). >> It's not really the tuple, just the tid > And, furthermore, it's used only to do prefetching, so even if the tid > was invalid when the tuple needs to be accessed, it wouldn't matter, > because the indexam wouldn't use the result of ampeeknexttuple to do > anything at that time. Nonetheless, getting the next tid out of the index may involve stepping to the next index page, at which point you've lost your interlock guaranteeing that the *previous* tid will still mean something by the time you arrive at its heap page. I presume that the ampeeknexttuple call is issued before trying to visit the heap (otherwise you're not actually getting much I/O overlap), so I think there's a real risk here. Having said that, it's probably OK as long as this mode is only invoked for user queries (with MVCC snapshots) and not for system indexscans. regards, tom lane
> From: tgl@sss.pgh.pa.us
> To: klaussfreire@gmail.com
> CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> Date: Thu, 29 May 2014 17:56:57 -0400
>
> Claudio Freire <klaussfreire@gmail.com> writes:
> > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
> >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> >>> nbtree/README).
>
> >> It's not really the tuple, just the tid
>
> > And, furthermore, it's used only to do prefetching, so even if the tid
> > was invalid when the tuple needs to be accessed, it wouldn't matter,
> > because the indexam wouldn't use the result of ampeeknexttuple to do
> > anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock
I think we are ok as peeknexttuple (yes bad name, sorry, can change it ...
never advances to another page :
* btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list
* without reading a new index page
* and without causing any side-effects such as altering values in control blocks
* if found, store blocknum in next element of pfch_list
> guaranteeing that the *previous* tid will still mean something by the time
> you arrive at its heap page. I presume that the ampeeknexttuple call is
> issued before trying to visit the heap (otherwise you're not actually
> getting much I/O overlap), so I think there's a real risk here.
>
> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.
>
> regards, tom lane
> To: klaussfreire@gmail.com
> CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> Date: Thu, 29 May 2014 17:56:57 -0400
>
> Claudio Freire <klaussfreire@gmail.com> writes:
> > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
> >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> >>> nbtree/README).
>
> >> It's not really the tuple, just the tid
>
> > And, furthermore, it's used only to do prefetching, so even if the tid
> > was invalid when the tuple needs to be accessed, it wouldn't matter,
> > because the indexam wouldn't use the result of ampeeknexttuple to do
> > anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock
I think we are ok as peeknexttuple (yes bad name, sorry, can change it ...
never advances to another page :
* btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list
* without reading a new index page
* and without causing any side-effects such as altering values in control blocks
* if found, store blocknum in next element of pfch_list
> guaranteeing that the *previous* tid will still mean something by the time
> you arrive at its heap page. I presume that the ampeeknexttuple call is
> issued before trying to visit the heap (otherwise you're not actually
> getting much I/O overlap), so I think there's a real risk here.
>
> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.
>
> regards, tom lane
On Thu, May 29, 2014 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe >>>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in >>>> nbtree/README). > >>> It's not really the tuple, just the tid > >> And, furthermore, it's used only to do prefetching, so even if the tid >> was invalid when the tuple needs to be accessed, it wouldn't matter, >> because the indexam wouldn't use the result of ampeeknexttuple to do >> anything at that time. > > Nonetheless, getting the next tid out of the index may involve stepping > to the next index page, at which point you've lost your interlock > guaranteeing that the *previous* tid will still mean something by the time No, no... that's exactly why a new regproc is needed, because for prefetching, we need to get the next tid that satisfies some conditions *without* walking the index. This, in nbtree, only looks through the tid array to find the suitable tid, or just return false if the array is exhausted. > Having said that, it's probably OK as long as this mode is only invoked > for user queries (with MVCC snapshots) and not for system indexscans. I think system index scans will also invoke this. There's no rule excluding that possibility.
On Thu, May 29, 2014 at 7:11 PM, John Lumby <johnlumby@hotmail.com> wrote: >> Nonetheless, getting the next tid out of the index may involve stepping >> to the next index page, at which point you've lost your interlock > > I think we are ok as peeknexttuple (yes bad name, sorry, can change it > ... > never advances to another page : > > * btpeeknexttuple() -- peek at the next tuple different from any > blocknum in pfch_list > * without reading a new index page > * and without causing any side-effects such as > altering values in control blocks > * if found, store blocknum in next element of pfch_list Yeah, I was getting to that conclusion myself too. We could call it amprefetchnextheap, since it does just prefetch, and is good for nothing *but* prefetch.
> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
I meant to add - it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer, i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below). We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind. (Which is also taken
care of in the patch).
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
I meant to add - it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer, i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below). We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind. (Which is also taken
care of in the patch).
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
Hi, On 2014-05-29 17:53:51 -0400, John Lumby wrote: > to see how we have done it. And I am attaching corrected version > of your test program which runs just fine. It's perfectly fine to not be up to the coding style at this point, but trying to adhere to it to some degree will make code review later less painfull... * comments with ** * line length * tabs vs spaces * ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 29, 2014 at 6:53 PM, John Lumby <johnlumby@hotmail.com> wrote: > Well, as mentioned earlier, it is not broken. Whether it is efficient > I am not sure. > I have looked at the mutex in aio_suspend that you mentioned and I am not > quite convinced that, if caller is not the original aio_read process, > it renders the suspend() into an instant timeout. I will see if I can > verify that. > Where are you (Claudio) seeing 10us? fd.c, in FileCompleteaio, sets timeout to: my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000; Which is 10k ns, which is 10 us. It loops 256 times at most, so it's polling 256 times with a 10 us timeout. Sounds wasteful. I'd: 1) If it's the same process, wait for the full timeout (no looping). If you have to loop (EAGAIN or EINTR - which I just noticed you don't check for), that's ok. 2) If it's not, just fall through, don't wait, issue the I/O. The kernel will merge the requests.
On Thu, May 29, 2014 at 7:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > 1) If it's the same process, wait for the full timeout (no looping). > If you have to loop (EAGAIN or EINTR - which I just noticed you don't > check for), that's ok. Sorry, meant to say just looping on EINTR. About the style guidelines, no, I just copy the style of surrounding code usually.
On 05/30/2014 12:53 AM, John Lumby wrote: >> Date: Thu, 29 May 2014 18:00:28 -0300 >> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch >> From: klaussfreire@gmail.com >> To: hlinnakangas@vmware.com >> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org >> >> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> On 05/29/2014 11:34 PM, Claudio Freire wrote: >>>> >>>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas >>>> <hlinnakangas@vmware.com> wrote: >>>>> >>>>> On 05/29/2014 04:12 PM, John Lumby wrote: >>>>>> >>>>>> >>>>>>> On 05/28/2014 11:52 PM, John Lumby wrote: >>>>>>> >>>>>>> The patch seems to assume that you can put the aiocb struct in shared >>>>>>> memory, initiate an asynchronous I/O request from one process, and wait >>>>>>> for its completion from another process. I'm pretty surprised if that >>>>>>> works on any platform. >>>>>> >>>>>> >>>>>> It works on linux. Actually this ability allows the asyncio >>>>>> implementation to >>>>>> reduce complexity in one respect (yes I know it looks complex enough) : >>>>>> it makes waiting for completion of an in-progress IO simpler than for >>>>>> the existing synchronous IO case,. since librt takes care of the >>>>>> waiting. >>>>>> specifically, no need for extra wait-for-io control blocks >>>>>> such as in bufmgr's WaitIO() >>>>> >>>>> >>>>> [checks]. No, it doesn't work. See attached test program. > > Thanks for checking and thanks for coming up with that test program. > However, yes, it really does work -- always (on linux). > Your test program is doing things in the wrong order - > it calls aio_suspend *before* aio_error. > However, the rule is, call aio_suspend *after* aio_error > and *only* if aio_error returns EINPROGRESS. I see no such a rule in the man pages of any of the functions involved. And it wouldn't matter anyway; the behavior is exactly the same if you aio_error() first. > See the code changes to fd.c function FileCompleteaio() > to see how we have done it. And I am attaching corrected version > of your test program which runs just fine. As Claudio mentioned earlier, the way FileCompleteaio() uses aio_suspend is just a complicated way of polling. You might as well replace the aio_suspend() calls with pg_usleep(). >>>>> It kinda seems to work sometimes, because of the way it's implemented in >>>>> glibc. The aiocb struct has a field for the result value and errno, and >>>>> when >>>>> the I/O is finished, the worker thread fills them in. aio_error() and >>>>> aio_return() just return the values of those fields, so calling >>>>> aio_error() >>>>> or aio_return() do in fact happen to work from a different process. >>>>> aio_suspend(), however, is implemented by sleeping on a process-local >>>>> mutex, >>>>> which does not work from a different process. >>>>> >>>>> Even if it worked on Linux today, it would be a bad idea to rely on it >>>>> from >>>>> a portability point of view. No, the only sane way to make this work is >>>>> that >>>>> the process that initiates an I/O request is responsible for completing >>>>> it. >>>>> If another process needs to wait for an async I/O to complete, we must >>>>> use >>>>> some other means to do the waiting. Like the io_in_progress_lock that we >>>>> already have, for the same purpose. >>>> >>>> But calls to it are timeouted by 10us, effectively turning the thing >>>> into polling mode. >>> >>> We don't want polling... And even if we did, calling aio_suspend() in a way >>> that's known to be broken, in a loop, is a pretty crappy way of polling. > > Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure. > I have looked at the mutex in aio_suspend that you mentioned and I am not > quite convinced that, if caller is not the original aio_read process, > it renders the suspend() into an instant timeout. I will see if I can verify that. I don't see the point of pursuing this design further. Surely we don't want to use polling here, and you're relying on undefined behavior anyway. I'm pretty sure aio_return/aio_error won't work from a different process on all platforms, even if it happens to work on Linux. Even on Linux, it might stop working if the underlying implementation changes from the glibc pthread emulation to something kernel-based. - Heikki
On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >>>> We don't want polling... And even if we did, calling aio_suspend() in a >>>> way >>>> that's known to be broken, in a loop, is a pretty crappy way of polling. >> >> >> Well, as mentioned earlier, it is not broken. Whether it is >> efficient I am not sure. >> I have looked at the mutex in aio_suspend that you mentioned and I am not >> quite convinced that, if caller is not the original aio_read process, >> it renders the suspend() into an instant timeout. I will see if I can >> verify that. > > > I don't see the point of pursuing this design further. Surely we don't want > to use polling here, and you're relying on undefined behavior anyway. I'm > pretty sure aio_return/aio_error won't work from a different process on all > platforms, even if it happens to work on Linux. Even on Linux, it might stop > working if the underlying implementation changes from the glibc pthread > emulation to something kernel-based. I'll try to do some measuring of performance with: a) git head b) git head + patch as-is c) git head + patch without aio_suspend in foreign processes (just re-read) d) git head + patch with a lwlock (or whatever works) instead of aio_suspend a-c will be the fastest, d might take some while. I'll let you know of the results as I get them.
On 05/30/14 09:36, Claudio Freire wrote:
>> I see no such a rule in the man pages of any of the functions involved. And it wouldn't matter anyway; the behavior is exactly the same if you aio_error() first.On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
You are completely correct and I was wrong.
For the case of different-process,
It is not the order of the aio calls that makes it work/not work,
it is whether it polls with a timeout.
I think I was confusing this with a rule concerning order for
calling aio_return.
We don't want polling... And even if we did, calling aio_suspend() in a way that's known to be broken, in a loop, is a pretty crappy way of polling.
I have made a change to the complete_aio functions
to use polling *only* if the caller is not the originator of the aio_read.
If it is the originator, it now calls aio_suspend with no timeout
(i.e. wait till complete).
The loop also now checks for the EINTR case which someone
pointed out.
In my test runs, with a debug message in FileCompleteaio to tell me
whether caller is or is not the originator of the aio_read,
I see > 99.8% of calls are from originator and only < 0.2% are not.
e.g. (samples from two backends)
different : 10
same : 11726
different : 38
same : 12105
new patch based on today 140531 is attached,
This improves one of my benchmarks by about 10% throughput,
and now shows an overall 23% improvement relative to existing code with posix_fadvise.
So hopefully this addresses your performance concern.
If you look at the new patch, you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier, it could just as well sleep for the same timeout,
but the small advantage of calling aio_suspend is if the io completed just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler. In fact this change is quite small, just a few lines
in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c
Based on this, I think it is not necessary to get rid of the polling altogether
(and in any case, as far as I can see, very difficult).
Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure. I have looked at the mutex in aio_suspend that you mentioned and I am not quite convinced that, if caller is not the original aio_read process, it renders the suspend() into an instant timeout. I will see if I can verify that.I don't see the point of pursuing this design further. Surely we don't want to use polling here, and you're relying on undefined behavior anyway. I'm pretty sure aio_return/aio_error won't work from a different process on all platforms, even if it happens to work on Linux. Even on Linux, it might stop working if the underlying implementation changes from the glibc pthread emulation to something kernel-based.
Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective, it will be caught in configure.
Claudio, I am not quite sure if what I am submitting now isI'll try to do some measuring of performance with: a) git head b) git head + patch as-is c) git head + patch without aio_suspend in foreign processes (just re-read) d) git head + patch with a lwlock (or whatever works) instead of aio_suspend a-c will be the fastest, d might take some while. I'll let you know of the results as I get them.
quite the same as any of yours. As I promised before, but have
not yet done, I will package one or two of my benchmarks and
send them in.
Attachment
<div class="moz-cite-prefix">On 05/31/14 20:44, johnlumby wrote:<br /></div><blockquote cite="mid:538A776C.2060006@hotmail.com"type="cite"><div class="moz-cite-prefix">On 05/30/14 09:36, Claudio Freire wrote:<br/></div><br /><font face="Nimbus Roman No9 L">Good point. I have included the guts of your little test program<br/> (modified to do polling) into the existing autoconf test program<br /> that decides on the <br /> #define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.<br/> See config/c-library.m4.<br /> I hope this goes some way to answer your concern aboutrobustness,<br /> as at least now if the implementation changes in some way that<br /> renders the polling ineffective, it will be caught in configure.<br /><br /></font><br /></blockquote><font face="Nimbus Roman No9 L">I meantto add that by including this test, which involves a fork(),<br /> in the autoconf tester, on Windows<br /> USE_AIO_ATOMIC_BUILTIN_COMP_SWAPwould always by un-defined.<br /> (But could then be defined manually if someone wanted togive it a try)<br /></font><br />
On Sat, May 31, 2014 at 9:44 PM, johnlumby <johnlumby@hotmail.com> wrote: > I'll try to do some measuring of performance with: > a) git head > b) git head + patch as-is > c) git head + patch without aio_suspend in foreign processes (just re-read) > d) git head + patch with a lwlock (or whatever works) instead of aio_suspend > > a-c will be the fastest, d might take some while. > > I'll let you know of the results as I get them. > > > Claudio, I am not quite sure if what I am submitting now is > quite the same as any of yours. As I promised before, but have > not yet done, I will package one or two of my benchmarks and > send them in. It's a tad different. c will not do polling on the foreign process, I will just let PG do the read again. d will be like polling, but without the associated CPU overhead.
On 06/01/2014 03:44 AM, johnlumby wrote: > If you look at the new patch, you'll see that for the different-pid case, > I still call aio_suspend with a timeout. > As you or Claudio pointed out earlier, it could just as well sleep > for the same timeout, > but the small advantage of calling aio_suspend is if the io completed > just between > the aio_error returning EINPROGRESS and the aio_suspend call. > Also it makes the code simpler. In fact this change is quite small, > just a few lines > in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c > > Based on this, I think it is not necessary to get rid of the polling > altogether > (and in any case, as far as I can see, very difficult). That's still just as wrong as it always has been. Just get rid of it. Don't put aiocb structs in shared memory at all. They don't belong there. >>>> Well, as mentioned earlier, it is not broken. Whether it is >>>> efficient I am not sure. >>>> I have looked at the mutex in aio_suspend that you mentioned and I am not >>>> quite convinced that, if caller is not the original aio_read process, >>>> it renders the suspend() into an instant timeout. I will see if I can >>>> verify that. >>> >>> I don't see the point of pursuing this design further. Surely we don't want >>> to use polling here, and you're relying on undefined behavior anyway. I'm >>> pretty sure aio_return/aio_error won't work from a different process on all >>> platforms, even if it happens to work on Linux. Even on Linux, it might stop >>> working if the underlying implementation changes from the glibc pthread >>> emulation to something kernel-based. > > Good point. I have included the guts of your little test program > (modified to do polling) into the existing autoconf test program > that decides on the > #define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP. > See config/c-library.m4. > I hope this goes some way to answer your concern about robustness, > as at least now if the implementation changes in some way that > renders the polling ineffective, it will be caught in configure. No, that does not make it robust enough. - Heikki
updated version of patch compatible with git head of 140608, (adjusted proc oid and a couple of minor fixes)
Attachment
I'm having trouble setting max_async_io_prefetchers in postgresql.conf It says it cannot be changed. Is that fixed at initdb time? (sounds like a bad idea if it is) On Sun, Jun 8, 2014 at 11:12 PM, johnlumby <johnlumby@hotmail.com> wrote: > updated version of patch compatible with git head of 140608, > (adjusted proc oid and a couple of minor fixes)
On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > How portable is POSIX aio nowadays? Googling around, it still seems that on > Linux, it's implemented using threads. Does the thread-emulation > implementation cause problems with the rest of the backend, which assumes > that there is only a single thread? In any case, I think we'll want to > encapsulate the AIO implementation behind some kind of an API, to allow > other implementations to co-exist. I think POSIX aio is pretty damn standard and it's a pretty fiddly interface. If we abstract it behind an i/o interface we're going to lose a lot of the power. Abstracting it behind a set of buffer manager operations (initiate i/o on buffer, complete i/o on buffer, abort i/o on buffer) should be fine but that's basically what we have, no? I don't think the threaded implementation on Linux is the one to use though. I find this *super* confusing but the kernel definitely supports aio syscalls, glibc also has a threaded implementation it uses if run on a kernel that doesn't implement the syscalls, and I think there are existing libaio and librt libraries from outside glibc that do one or the other. Which you build against seems to make a big difference. My instinct is that anything but the kernel native implementation will be worthless. The overhead of thread communication will completely outweigh any advantage over posix_fadvise's partial win. The main advantage of posix aio is that we can actually receive the data out of order. With posix_fadvise we can get the i/o and cpu overlap but we will never process the later blocks until the earlier requests are satisfied and processed in order. With aio you could do a sequential scan, initiating i/o on 1,000 blocks and then processing them as they arrive, initiating new requests as those blocks are handled. When I investigated this I found the buffer manager's I/O bits seemed to already be able to represent the state we needed (i/o initiated on this buffer but not completed). The problem was in ensuring that a backend would process the i/o completion promptly when it might be in the midst of handling other tasks and might even get an elog() stack unwinding. The interface that actually fits Postgres best might be the threaded interface (orthogonal to the threaded implementation question) which is you give aio a callback which gets called on a separate thread when the i/o completes. The alternative is you give aio a list of operation control blocks and it tells you the state of all the i/o operations. But it's not clear to me how you arrange to do that regularly, promptly, and reliably. The other gotcha here is that the kernel implementation only does anything useful on DIRECT_IO files. That means you have to do *all* the prefetching and i/o scheduling yourself. You would be doing that anyways for sequential scans and bitmap scans -- and we already do it with things like synchronised scans and posix_fadvise -- but index scans would need to get some intelligence for when it makes sense to read more than one page at a time. It might be possible to do something fairly coarse like having our i/o operators keep track of how often i/o on a relation falls within a certain number of blocks of an earlier i/o and autotune number of blocks to read based on that. It might not be hard to do better than the kernel with even basic info like what level of the index we're reading or what type of pointer we're following. Finally, when I did the posix_fadvise work I wrote a synthetic benchmark for testing the equivalent i/o pattern of a bitmap scan. It let me simulate bitmap scans of varying densities with varying parameters, notably how many i/o to keep in flight at once. It supported posix_fadvise or aio. You should look it up in the archives, it made for some nice looking graphs. IIRC I could not find any build environment where aio offered any performance boost at all. I think this means I just didn't know how to build it against the right libraries or wasn't using the right kernel or there was some skew between them at the time. -- greg
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote: > I don't think the threaded implementation on Linux is the one to use > though. I find this *super* confusing but the kernel definitely > supports aio syscalls, glibc also has a threaded implementation it > uses if run on a kernel that doesn't implement the syscalls, and I > think there are existing libaio and librt libraries from outside glibc > that do one or the other. Which you build against seems to make a big > difference. My instinct is that anything but the kernel native > implementation will be worthless. The overhead of thread communication > will completely outweigh any advantage over posix_fadvise's partial > win. What overhead? The only communication is through a "done" bit and associated synchronization structure when *and only when* you want to wait on it. Furthermore, posix_fadvise is braindead on this use case, been there, done that. What you win with threads is a better postgres-kernel interaction, even if you loose some CPU performance it's gonna beat posix_fadvise by a large margin. > The main advantage of posix aio is that we can actually receive the > data out of order. With posix_fadvise we can get the i/o and cpu > overlap but we will never process the later blocks until the earlier > requests are satisfied and processed in order. With aio you could do a > sequential scan, initiating i/o on 1,000 blocks and then processing > them as they arrive, initiating new requests as those blocks are > handled. And each and every I/O you issue with it counts as such on the kernel side. It's not the case with posix_fadvise, mind you, and that's terribly damaging for performance. > When I investigated this I found the buffer manager's I/O bits seemed > to already be able to represent the state we needed (i/o initiated on > this buffer but not completed). The problem was in ensuring that a > backend would process the i/o completion promptly when it might be in > the midst of handling other tasks and might even get an elog() stack > unwinding. The interface that actually fits Postgres best might be the > threaded interface (orthogonal to the threaded implementation > question) which is you give aio a callback which gets called on a > separate thread when the i/o completes. The alternative is you give > aio a list of operation control blocks and it tells you the state of > all the i/o operations. But it's not clear to me how you arrange to do > that regularly, promptly, and reliably. Indeed we've been musing about using librt's support of completion callbacks for that. > The other gotcha here is that the kernel implementation only does > anything useful on DIRECT_IO files. That means you have to do *all* > the prefetching and i/o scheduling yourself. If that's the case, we should discard kernel-based implementations and stick to thread-based ones. Postgres cannot do scheduling as efficiently as the kernel, and it shouldn't try. > You would be doing that > anyways for sequential scans and bitmap scans -- and we already do it > with things like synchronised scans and posix_fadvise That only works because the patterns are semi-sequential. If you try to schedule random access, it becomes effectively impossible without hardware info. The kernel is the one with hardware info. > Finally, when I did the posix_fadvise work I wrote a synthetic > benchmark for testing the equivalent i/o pattern of a bitmap scan. It > let me simulate bitmap scans of varying densities with varying > parameters, notably how many i/o to keep in flight at once. It > supported posix_fadvise or aio. You should look it up in the archives, > it made for some nice looking graphs. IIRC I could not find any build > environment where aio offered any performance boost at all. I think > this means I just didn't know how to build it against the right > libraries or wasn't using the right kernel or there was some skew > between them at the time. If it's old, it probable you didn't hit the kernel's braindeadness since it was introduced somewhat recently (somewhate, ie, before 3.x I believe). Even if you did hit it, bitmap heap scans are blessed with sequential ordering. The technique doesn't work nearly as well with random I/O that might be sorted from time to time. When traversing an index, you do a mostly sequential pattern due to physical correlation, but not completely sequential. Not only that, with the assumption of random I/O, and the uncertainty of when will the scan be aborted too, you don't read ahead as much as you could if you knew it was sequential or a full scan. That kills performance. You don't fetch enough ahead of time to avoid stalls, and the kernel doesn't do read-ahead either because posix_fadvise effectively disables it, resulting in the equivalent of direct I/O with bad scheduling. Solving this for index scans isn't just a little more complex. It's insanely more complex, because you need hardware information to do it right. How many spindles, how many sectors per cylinder if it's rotational, how big the segments if it's flash, etc, etc... all stuff hidden away inside the kernel. It's not a good idea to try to do the kernel's job. Aio, even threaded, lets you avoid tha. If you still have the benchmark around, I suggest you shuffle the sectors a little bit (but not fully) and try them with semi-random I/O.
On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote: > updated version of patch compatible with git head of 140608, > (adjusted proc oid and a couple of minor fixes) Compilation of patched version on MacOS failed. The error messages were gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c heapam.c: In function 'heap_unread_add': heapam.c:362: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:363: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_count' heapam.c:369: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_count' heapam.c:375: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' heapam.c:381: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' heapam.c:387: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:405: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_count' heapam.c: In function 'heap_unread_subtract': heapam.c:419: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:425: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_count' heapam.c:434: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' heapam.c:442: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' heapam.c:452: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:453: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:454: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_next' heapam.c:456: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_count' heapam.c: In function 'heapgettup': heapam.c:944: error: 'struct HeapScanDescData' has no member named 'rs_pfchblock' heapam.c:716: warning: unused variable 'ix' heapam.c: In function 'heapgettup_pagemode': heapam.c:1243: error: 'struct HeapScanDescData' has no member named 'rs_pfchblock' heapam.c:1029: warning: unused variable 'ix' heapam.c: In function 'heap_endscan': heapam.c:1808: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' heapam.c:1809: error: 'struct HeapScanDescData' has no member named 'rs_Unread_Pfetched_base' make[4]: *** [heapam.o] Error 1 make[3]: *** [heap-recursive] Error 2 make[2]: *** [access-recursive] Error 2 make[1]: *** [install-backend-recurse] Error 2 make: *** [install-src-recurse] Error 2 Huge patch is basically not easy to review. What about simplifying the patch by excluding non-core parts like the change of pg_stat_statements, so that reviewers can easily read the patch? We can add such non-core parts later. Regards, -- Fujii Masao
Thanks Fujii , that is a bug -- an #ifdef USE_PREFETCH is missing in heapam.c (maybe several) I will fix it in the next patch version. I also appreciate it is not easy to review the patch. There are really 4 (or maybe 5) parts : . async io (librt functions) . buffer management (allocating, locking and pinning etc) . scanners making prefetch calls . statistics and the autoconf input program I will see what I can do. Maybe putting an indicator against each modified file? I am currently working on two things : . alternative way for non-originator of an aio_read to wait on completion (LWlock instead of polling the aiocb) This was talked about in several earlier posts and Claudio is also working on something there . package up my benchmark Cheers John ---------------------------------------- > Date: Fri, 20 Jun 2014 04:21:19 +0900 > Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch > From: masao.fujii@gmail.com > To: johnlumby@hotmail.com > CC: pgsql-hackers@postgresql.org; klaussfreire@gmail.com > > On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote: >> updated version of patch compatible with git head of 140608, >> (adjusted proc oid and a couple of minor fixes) > > Compilation of patched version on MacOS failed. The error messages were > > gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels > -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing > -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c > heapam.c: In function 'heap_unread_add': > heapam.c:362: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:363: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_count' > heapam.c:369: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_count' > heapam.c:375: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > heapam.c:381: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > heapam.c:387: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:405: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_count' > heapam.c: In function 'heap_unread_subtract': > heapam.c:419: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:425: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_count' > heapam.c:434: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > heapam.c:442: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > heapam.c:452: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:453: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:454: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_next' > heapam.c:456: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_count' > heapam.c: In function 'heapgettup': > heapam.c:944: error: 'struct HeapScanDescData' has no member named > 'rs_pfchblock' > heapam.c:716: warning: unused variable 'ix' > heapam.c: In function 'heapgettup_pagemode': > heapam.c:1243: error: 'struct HeapScanDescData' has no member named > 'rs_pfchblock' > heapam.c:1029: warning: unused variable 'ix' > heapam.c: In function 'heap_endscan': > heapam.c:1808: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > heapam.c:1809: error: 'struct HeapScanDescData' has no member named > 'rs_Unread_Pfetched_base' > make[4]: *** [heapam.o] Error 1 > make[3]: *** [heap-recursive] Error 2 > make[2]: *** [access-recursive] Error 2 > make[1]: *** [install-backend-recurse] Error 2 > make: *** [install-src-recurse] Error 2 > > > Huge patch is basically not easy to review. What about simplifying the patch > by excluding non-core parts like the change of pg_stat_statements, so that > reviewers can easily read the patch? We can add such non-core parts later. > > Regards, > > -- > Fujii Masao
On 6/20/14, 5:12 PM, John Lumby wrote: > I also appreciate it is not easy to review the patch. > There are really 4 (or maybe 5) parts : > > . async io (librt functions) > . buffer management (allocating, locking and pinning etc) > . scanners making prefetch calls > . statistics > > and the autoconf input program > > I will see what I can do. Maybe putting an indicator against each modified file? Generally the best way to handle cases like this is to create multiple patches, each of which builds upon previous ones. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
---------------------------------------- > Date: Thu, 19 Jun 2014 15:43:44 -0300 > Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch > From: klaussfreire@gmail.com > To: stark@mit.edu > CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org > > On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote: >> I don't think the threaded implementation on Linux is the one to use >> though. [... ] The overhead of thread communication >> will completely outweigh any advantage over posix_fadvise's partial >> win. > > What overhead? > > The only communication is through a "done" bit and associated > synchronization structure when *and only when* you want to wait on it. > Threads do cost some extra CPU, but provided the system had some spare CPU capacity, then performance improves because of reduced IO wait. I quoted a measured improvement of 17% - 18% improvement in the README along with some more explanation of when the asyc IO gives and improvement. > Furthermore, posix_fadvise is braindead on this use case, been there, > done that. What you win with threads is a better postgres-kernel > interaction, even if you loose some CPU performance it's gonna beat > posix_fadvise by a large margin. > > [...] > >> When I investigated this I found the buffer manager's I/O bits seemed >> to already be able to represent the state we needed (i/o initiated on >> this buffer but not completed). The problem was in ensuring that a >> backend would process the i/o completion promptly when it might be in >> the midst of handling other tasks and might even get an elog() stack >> unwinding. The interface that actually fits Postgres best might be the >> threaded interface (orthogonal to the threaded implementation >> question) which is you give aio a callback which gets called on a >> separate thread when the i/o completes. The alternative is you give >> aio a list of operation control blocks and it tells you the state of >> all the i/o operations. But it's not clear to me how you arrange to do >> that regularly, promptly, and reliably. > > Indeed we've been musing about using librt's support of completion > callbacks for that. For the most common case of a backend issues a PrefetchBuffer and then that *same* backend issues ReadBuffer, the posix aio works ideally, since there is no need for any callback or completion signal, we simply check "is it complete" during the ReadBuffer. It is when some *other* backend gets there first with the ReadBuffer that things are a bit trickier. The current version of the patch did polling for that case but that drew criticism, and so an imminent new version of the patch uses the sigevent mechanism. And there are other ways still. In an earlier posting I reported that , in my benchmark, 99.8% of [FileCompleteaio] calls are from originator and only < 0.2% are not.so, from a performance perspective, onlythe common case really matters.
On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote: > It is when some *other* backend gets there first with the ReadBuffer that > things are a bit trickier. The current version of the patch did polling for that case > but that drew criticism, and so an imminent new version of the patch > uses the sigevent mechanism. And there are other ways still. I'm a bit puzzled by this though. Postgres *already* has code for this case. When you call ReadBuffer you set the bits on the buffer indicating I/O is in progress. If another backend does ReadBuffer for the same block they'll get the same buffer and then wait until the first backend's I/O completes. ReadBuffer goes through some hoops to handle this (and all the corner cases such as the other backend's I/O completing and the buffer being reused for another block before the first backend reawakens). It would be a shame to reinvent the wheel. The problem with using the Buffers I/O in progress bit is that the I/O might complete while the other backend is busy doing stuff. As long as you can handle the I/O completion promptly -- either in callback or thread or signal handler then that wouldn't matter. But I'm not clear that any of those will work reliably. -- greg
---------------------------------------- > From: stark@mit.edu > Date: Mon, 23 Jun 2014 16:04:50 -0700 > Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch > To: johnlumby@hotmail.com > CC: klaussfreire@gmail.com; hlinnakangas@vmware.com; pgsql-hackers@postgresql.org > > On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote: >> It is when some *other* backend gets there first with the ReadBuffer that >> things are a bit trickier. The current version of the patch did polling for that case >> but that drew criticism, and so an imminent new version of the patch >> uses the sigevent mechanism. And there are other ways still. > > I'm a bit puzzled by this though. Postgres *already* has code for this > case. When you call ReadBuffer you set the bits on the buffer Good question. Let me explain. Yes, postgresql has code for the case of a backend is inside a synchronous read() or write(), performed from a ReadBuffer(), and some other backend wants that buffer. asynchronous aio is initiated not from ReadBuffer but from PrefetchBuffer, and performs its aio_read into an allocated, pinned, postgresql buffer. This is entirely different from the synchronous io case. Why? Because the issuer of the aio_read (the "originator") is unaware of this buffer pinned on its behalf, and is then free to do any other reading or writing it wishes, such as more prefetching or any other operation. And furthermore, it may *never* issue a ReadBuffer for the block which it prefetched. Therefore, asynchronous IO is different from synchronous IO, and a new bit, BM_AIO_IN_PROGRESS, in the buf_header is required to track this aio operation until completion. I would encourage you to read the new postgresql-prefetching-asyncio.README in the patch file where this is explained in greater detail. > indicating I/O is in progress. If another backend does ReadBuffer for > the same block they'll get the same buffer and then wait until the > first backend's I/O completes. ReadBuffer goes through some hoops to > handle this (and all the corner cases such as the other backend's I/O > completing and the buffer being reused for another block before the > first backend reawakens). It would be a shame to reinvent the wheel. No re-invention! Actually some effort has been made to use the existing functions in bufmgr.c as much as possible rather than rewriting them. > > The problem with using the Buffers I/O in progress bit is that the I/O > might complete while the other backend is busy doing stuff. As long as > you can handle the I/O completion promptly -- either in callback or > thread or signal handler then that wouldn't matter. But I'm not clear > that any of those will work reliably. They both work reliably, but the criticism was that backend B polling an aiocb of an aio issued by backend A is not documented as being supported (although it happens to work), hence the proposed change to use sigevent. By the way, on the "will it actually work though?" question which several folks have raised, I should mention that this patch has been in semi-production use for almost 2 years now in different stages of completion on all postgresql releases from 9.1.4 to 9.5 devel. I would guess it has had around 500 hours of operation by now. I'm sure there are bugs still to be found but I am confident it is fundamentally sound. > > -- > greg
On 06/24/2014 04:29 PM, John Lumby wrote: >> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote: >>> It is when some *other* backend gets there first with the ReadBuffer that >>> things are a bit trickier. The current version of the patch did polling for that case >>> but that drew criticism, and so an imminent new version of the patch >>> uses the sigevent mechanism. And there are other ways still. >> >> I'm a bit puzzled by this though. Postgres *already* has code for this >> case. When you call ReadBuffer you set the bits on the buffer > > Good question. Let me explain. > Yes, postgresql has code for the case of a backend is inside a synchronous > read() or write(), performed from a ReadBuffer(), and some other backend > wants that buffer. asynchronous aio is initiated not from ReadBuffer > but from PrefetchBuffer, and performs its aio_read into an allocated, pinned, > postgresql buffer. This is entirely different from the synchronous io case. > Why? Because the issuer of the aio_read (the "originator") is unaware > of this buffer pinned on its behalf, and is then free to do any other > reading or writing it wishes, such as more prefetching or any other operation. > And furthermore, it may *never* issue a ReadBuffer for the block which it > prefetched. I still don't see the difference. Once an asynchronous read is initiated on the buffer, it can't be used for anything else until the read has finished. This is exactly the same situation as with a synchronous read: after read() is called, the buffer can't be used for anything else until the call finishes. In particular, consider the situation from another backend's point of view. Looking from another backend (i.e. one that didn't initiate the read), there's no difference between a synchronous and asynchronous read. So why do we need a different IPC mechanism for the synchronous and asynchronous cases? We don't. I understand that *within the backend*, you need to somehow track the I/O, and you'll need to treat synchronous and asynchronous I/Os differently. But that's all within the same backend, and doesn't need to involve the flags or locks in shared memory at all. The inter-process communication doesn't need any changes. >> The problem with using the Buffers I/O in progress bit is that the I/O >> might complete while the other backend is busy doing stuff. As long as >> you can handle the I/O completion promptly -- either in callback or >> thread or signal handler then that wouldn't matter. But I'm not clear >> that any of those will work reliably. > > They both work reliably, but the criticism was that backend B polling > an aiocb of an aio issued by backend A is not documented as > being supported (although it happens to work), hence the proposed > change to use sigevent. You didn't understand what Greg meant. You need to handle the completion of the I/O in the same process that initiated it, by clearing the in-progress bit of the buffer and releasing the I/O in-progress lwlock on it. And you need to do that very quickly after the I/O has finished, because there might be another backend waiting for the buffer and you don't want him to wait longer than necessary. The question is, if you receive the notification of the I/O completion using a signal or a thread, is it safe to release the lwlock from the signal handler or a separate thread? > By the way, on the "will it actually work though?" question which several folks > have raised, I should mention that this patch has been in semi-production > use for almost 2 years now in different stages of completion on all postgresql > releases from 9.1.4 to 9.5 devel. I would guess it has had around > 500 hours of operation by now. I'm sure there are bugs still to be > found but I am confident it is fundamentally sound. Well, a committable version of this patch is going to look quite different from the first version that you posted, so I don't put much weight on how long you've tested the first version. - Heikki
Thanks Heikki, ---------------------------------------- > Date: Tue, 24 Jun 2014 17:02:38 +0300 > From: hlinnakangas@vmware.com > To: johnlumby@hotmail.com; stark@mit.edu > CC: klaussfreire@gmail.com; pgsql-hackers@postgresql.org > Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch > > On 06/24/2014 04:29 PM, John Lumby wrote: >>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote: >>>> It is when some *other* backend gets there first with the ReadBuffer that >>>> things are a bit trickier. The current version of the patch did polling for that case >>>> but that drew criticism, and so an imminent new version of the patch >>>> uses the sigevent mechanism. And there are other ways still. >>> >>> I'm a bit puzzled by this though. Postgres *already* has code for this >>> case. When you call ReadBuffer you set the bits on the buffer >> >> Good question. Let me explain. >> Yes, postgresql has code for the case of a backend is inside a synchronous >> read() or write(), performed from a ReadBuffer(), and some other backend >> wants that buffer. asynchronous aio is initiated not from ReadBuffer >> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned, >> postgresql buffer. This is entirely different from the synchronous io case. >> Why? Because the issuer of the aio_read (the "originator") is unaware >> of this buffer pinned on its behalf, and is then free to do any other >> reading or writing it wishes, such as more prefetching or any other operation. >> And furthermore, it may *never* issue a ReadBuffer for the block which it >> prefetched. > > I still don't see the difference. Once an asynchronous read is initiated > on the buffer, it can't be used for anything else until the read has > finished. This is exactly the same situation as with a synchronous read: > after read() is called, the buffer can't be used for anything else until > the call finishes. Ah, now I see what you and Greg are asking. See my next imbed below. > > In particular, consider the situation from another backend's point of > view. Looking from another backend (i.e. one that didn't initiate the > read), there's no difference between a synchronous and asynchronous > read. So why do we need a different IPC mechanism for the synchronous > and asynchronous cases? We don't. > > I understand that *within the backend*, you need to somehow track the > I/O, and you'll need to treat synchronous and asynchronous I/Os > differently. But that's all within the same backend, and doesn't need to > involve the flags or locks in shared memory at all. The inter-process > communication doesn't need any changes. > >>> The problem with using the Buffers I/O in progress bit is that the I/O >>> might complete while the other backend is busy doing stuff. As long as >>> you can handle the I/O completion promptly -- either in callback or >>> thread or signal handler then that wouldn't matter. But I'm not clear >>> that any of those will work reliably. >> >> They both work reliably, but the criticism was that backend B polling >> an aiocb of an aio issued by backend A is not documented as >> being supported (although it happens to work), hence the proposed >> change to use sigevent. > > You didn't understand what Greg meant. You need to handle the completion > of the I/O in the same process that initiated it, by clearing the > in-progress bit of the buffer and releasing the I/O in-progress lwlock > on it. And you need to do that very quickly after the I/O has finished, > because there might be another backend waiting for the buffer and you > don't want him to wait longer than necessary. I think I understand the question now. I didn't spell out the details earlier. Let me explain a little more. With this patch, when read is issued, it is either a synchronous IO (as before), or an asynchronous aio_read (new, represented by both BM_IO_IN_PROGRESS *and* BM_AIO_IN_PROGRESS). The way other backends wait on a synchronous IO in progress is unchanged. But if BM_AIO_IN_PROGRESS, then *any* backend which requests ReadBuffer on this block (including originator) follows a new path through BufCheckAsync() which, depending on various flags and context, send the backend down to FileCompleteaio to check and maybe wait. So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer will wait in that way. > > The question is, if you receive the notification of the I/O completion > using a signal or a thread, is it safe to release the lwlock from the > signal handler or a separate thread? In the forthcoming new version of the patch that uses sigevent, the originator locks a LWlock associated with that BAaiocb eXclusive, and , when signalled, in the signal handler it places that LWlock on a process-local queue of LWlocks awaiting release. (No, It cannot be safely released inside the signal handler or in a separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro and at a few additional points in bufmgr, the backend walks this process-local queue and releases those LWlocks. This is also done if the originator itself issues a ReadBuffer, which is the most frequent case in which it is released. Meanwhile, any other backend will simply acquire Shared and release. I think you are right that the existing io_in_progress_lock LWlock in the buf_header could be used for this, because if there is a aio in progress, then that lock cannot be in use for synchronous IO. I chose not to use it because I preferred to keep the wait/post for asynch io separate, but they could both use the same LWlock. However, the way the LWlock is acquired and released would still be a bit different because of the need to have the originator release it in its mainline. > >> By the way, on the "will it actually work though?" question which several folks >> have raised, I should mention that this patch has been in semi-production >> use for almost 2 years now in different stages of completion on all postgresql >> releases from 9.1.4 to 9.5 devel. I would guess it has had around >> 500 hours of operation by now. I'm sure there are bugs still to be >> found but I am confident it is fundamentally sound. > > Well, a committable version of this patch is going to look quite > different from the first version that you posted, so I don't put much > weight on how long you've tested the first version. Yes, I am quite willing to change it, time permitting. I take the works "committable version" as a positive sign ... > > - Heikki
On 06/24/2014 06:08 PM, John Lumby wrote: >> The question is, if you receive the notification of the I/O completion >> using a signal or a thread, is it safe to release the lwlock from the >> signal handler or a separate thread? > > In the forthcoming new version of the patch that uses sigevent, > the originator locks a LWlock associated with that BAaiocb eXclusive, > and , when signalled, in the signal handler it places that LWlock > on a process-local queue of LWlocks awaiting release. > (No, It cannot be safely released inside the signal handler or in a > separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro > and at a few additional points in bufmgr, the backend walks this process-local > queue and releases those LWlocks. This is also done if the originator > itself issues a ReadBuffer, which is the most frequent case in which it > is released. > > Meanwhile, any other backend will simply acquire Shared and release. Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast enough? We have never made any hard guarantees on how often CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls for many seconds, or even more. That's a long time to hold onto a buffer I/O lock. I don't think that's acceptable :-(. > I think you are right that the existing io_in_progress_lock LWlock in the > buf_header could be used for this, because if there is a aio in progress, > then that lock cannot be in use for synchronous IO. I chose not to use it > because I preferred to keep the wait/post for asynch io separate, > but they could both use the same LWlock. However, the way the LWlock > is acquired and released would still be a bit different because of the need > to have the originator release it in its mainline. It would be nice to use the same LWLock. However, if releasing a regular LWLock in a signal handler is not safe, and cannot be made safe, perhaps we should, after all, invent a whole new mechanism. One that would make it safe to release the lock in a signal handler. >>> By the way, on the "will it actually work though?" question which several folks >>> have raised, I should mention that this patch has been in semi-production >>> use for almost 2 years now in different stages of completion on all postgresql >>> releases from 9.1.4 to 9.5 devel. I would guess it has had around >>> 500 hours of operation by now. I'm sure there are bugs still to be >>> found but I am confident it is fundamentally sound. >> >> Well, a committable version of this patch is going to look quite >> different from the first version that you posted, so I don't put much >> weight on how long you've tested the first version. > > Yes, I am quite willing to change it, time permitting. > I take the works "committable version" as a positive sign ... BTW, sorry if I sound negative, I'm actually quite excited about this feature. A patch like this take a lot of work, and usually several rewrites, until it's ready ;-). But I'm looking forward for it. - Heikki
On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote: >> The question is, if you receive the notification of the I/O completion >> using a signal or a thread, is it safe to release the lwlock from the >> signal handler or a separate thread? > > In the forthcoming new version of the patch that uses sigevent, > the originator locks a LWlock associated with that BAaiocb eXclusive, > and , when signalled, in the signal handler it places that LWlock > on a process-local queue of LWlocks awaiting release. > (No, It cannot be safely released inside the signal handler or in a > separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro > and at a few additional points in bufmgr, the backend walks this process-local > queue and releases those LWlocks. This is also done if the originator > itself issues a ReadBuffer, which is the most frequent case in which it > is released. I suggest using a semaphore instead. Semaphores are supposed to be incremented/decremented from multiple threads or processes already. So, in theory, the callback itself should be able to do it. The problem with the process-local queue is that it may take time to be processed (the time it takes to get to a CHECK_INTERRUPTS macro, which as it happened with regexes, it can be quite high).
On 06/25/2014 09:20 PM, Claudio Freire wrote: > On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote: >>> The question is, if you receive the notification of the I/O completion >>> using a signal or a thread, is it safe to release the lwlock from the >>> signal handler or a separate thread? >> >> In the forthcoming new version of the patch that uses sigevent, >> the originator locks a LWlock associated with that BAaiocb eXclusive, >> and , when signalled, in the signal handler it places that LWlock >> on a process-local queue of LWlocks awaiting release. >> (No, It cannot be safely released inside the signal handler or in a >> separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro >> and at a few additional points in bufmgr, the backend walks this process-local >> queue and releases those LWlocks. This is also done if the originator >> itself issues a ReadBuffer, which is the most frequent case in which it >> is released. > > I suggest using a semaphore instead. > > Semaphores are supposed to be incremented/decremented from multiple > threads or processes already. So, in theory, the callback itself > should be able to do it. LWLocks are implemented with semaphores, so if you can increment a semaphore in the signal handler / callback thread, then in theory you should be able to release a LWLock. You'll need some additional synchronization within the same process, to make sure you don't release a lock in signal handler while you're just doing the same thing in the main thread. I'm not sure I want to buy into the notion that LWLockRelease must be generally safe to call from a signal handler, but maybe it's possible to have a variant of it that is. On Linux at least we use System V semaphores, which are (unsurpisingly) not listed as safe for using in a signal handler. See list Async-signal-safe functions in signal(7) man page. The function used to increment a POSIX semaphore, sem_post(), is in the list, however. - Heikki
On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote: > LWLocks are implemented with semaphores, so if you can increment a semaphore > in the signal handler / callback thread, then in theory you should be able > to release a LWLock. I don't think that's a convincing argument even if semop et al were signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do anything with the same spinlock both inside and outside a signal handler. I don't think we're going to get lwlock.c/LWLockRelease to work reasonably from a signal handler without a lot of work. > On Linux at least we use System V semaphores, which are (unsurpisingly) not > listed as safe for using in a signal handler. See list Async-signal-safe > functions in signal(7) man page. The function used to increment a POSIX > semaphore, sem_post(), is in the list, however. Heh, just wrote the same after reading the beginning of your email ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I am attaching the new version of the patch with support for use of sigevent to report completion of asynch io and the new LWlock for waiters to wait for originator to release it. This patch is based on up-to-date git head but the asyncio design is a bit behind the curve of recent discussions here. Specifically, sigevent is still optional (see README for how to configure it) and non-originators still refer to the originator's aiocb to check the completion code (but not for waiting for it to complete). All sigevent-related code is marked by #ifdef USE_AIO_SIGEVENT to make it easier to find. Also although it "works" it is not functionally complete - search for FIXME in src/backend/storage/buffer/buf_async.c and has not had nearly as much testing. ________________________________________________________________________ I will try to paste earlier points and imbed comments : Heikki wrote : > Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast > enough? We have never made any hard guarantees on how often > CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd > party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls > for many seconds, or even more. That's a long time to hold onto a buffer > I/O lock. I don't think that's acceptable :-(. true but remember this case is the very infrequent one, less than 0.2 % (in my benchmark). > >> I think you are right that the existing io_in_progress_lock LWlock in the >> buf_header could be used for this, because if there is a aio in progress, >> then that lock cannot be in use for synchronous IO. I chose not to use it >> because I preferred to keep the wait/post for asynch io separate, >> but they could both use the same LWlock. However, the way the LWlock >> is acquired and released would still be a bit different because of the need >> to have the originator release it in its mainline. > > It would be nice to use the same LWLock. I think that will work. > > However, if releasing a regular LWLock in a signal handler is not safe, > and cannot be made safe, perhaps we should, after all, invent a whole > new mechanism. One that would make it safe to release the lock in a > signal handler. It would take rewriting lwlock.c and still be devillish hard to test - I would prefer not. > >> >>> Well, a committable version of this patch is going to look quite >>> different from the first version that you posted, so I don't put much >>> weight on how long you've tested the first version. >> >> Yes, I am quite willing to change it, time permitting. >> I take the works "committable version" as a positive sign ... > > BTW, sorry if I sound negative, I'm actually quite excited about this > feature. A patch like this take a lot of work, and usually several > rewrites, until it's ready ;-). But I'm looking forward for it. I am grateful you spent the time to study it. knowing the code, I am not surprised it made you a bit grumpy ... ________________________________________________________ discussion about what is the difference between a synchronous read versus an asynchronous read as far as non-originator waiting on it is concerned. I thought a bit more about this. There are currently two differences, one of which can easily be changed and one not so easy. 1) The current code, even with sigevent, still makes the non-originator waiter call aio_error on the originator's aiocb to get the completion code. For sigevent variation, easily changed to have the originator always call aio_error (from its CHECK_INTERRUPTS or from FIleCompleteaio) and store that in the BAiocb. My idea of why not to do that was that, by having the non-originator check the aiocb, this would allow the waiter to proceed sooner. But for a different reason it actually doesn't. (The non-originator must still wait for the LWlock release) 2) Buffer pinning and returning the BufferAiocb to the free list With synchronous IO, each backend that calls a ReadBuffer must pin the buffer early in the process. With asynchronous IO, initially only the originator gets the pin (and that is during PrefetchBuffer, not Readbuffer) When the aio completes and some backend checks that completion, then the backend has various responsibilities: . pin the buffer if it did not already have one (from prefetch) . if it was the last such backend to make that check (amongst the cohort waiting on it) then pin the buffer if it did not already have one (from prefetch) Those functions are different from the synchronous IO case. I think code clarity alone may dictate keeping these separate. ___________________________________________________________________ various discussion of semaphores and LWlocks: ---------------------------------------- > Date: Wed, 25 Jun 2014 23:17:53 +0200 > From: andres@2ndquadrant.com > To: hlinnakangas@vmware.com > CC: klaussfreire@gmail.com; johnlumby@hotmail.com; stark@mit.edu; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch > > On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote: >> LWLocks are implemented with semaphores, so if you can increment a semaphore >> in the signal handler / callback thread, then in theory you should be able >> to release a LWLock. > > I don't think that's a convincing argument even if semop et al were > signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do > anything with the same spinlock both inside and outside a signal > handler. > I don't think we're going to get lwlock.c/LWLockRelease to work > reasonably from a signal handler without a lot of work. I agree - see earlier.
Attachment
My cut'n'pasting failed me at one point corrected below. > discussion about what is the difference between a synchronous read > versus an asynchronous read as far as non-originator waiting on it is concerned. > > I thought a bit more about this. There are currently two differences, > one of which can easily be changed and one not so easy. > > 1) The current code, even with sigevent, still makes the non-originator waiter > call aio_error on the originator's aiocb to get the completion code. > For sigevent variation, easily changed to have the originator always call aio_error > (from its CHECK_INTERRUPTS or from FIleCompleteaio) > and store that in the BAiocb. > My idea of why not to do that was that, by having the non-originator check the aiocb, > this would allow the waiter to proceed sooner. But for a different reason it actually > doesn't. (The non-originator must still wait for the LWlock release) > > 2) Buffer pinning and returning the BufferAiocb to the free list > With synchronous IO, each backend that calls a ReadBuffer must pin the buffer > early in the process. > With asynchronous IO, initially only the originator gets the pin > (and that is during PrefetchBuffer, not Readbuffer) > When the aio completes and some backend checks that completion, > then the backend has various responsibilities: > > . pin the buffer if it did not already have one (from prefetch) > . if it was the last such backend to make that check > (amongst the cohort waiting on it) > then XXXXXXpin the buffer if it did not already have one (from prefetch)XXXX then return the BufferAiocb to the free list
I am attaching a new version of the patch for consideration in the current commit fest. Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl the method for handling aio completion using sigevent has been re-written to use signals exclusively rather than a composite of signals and LWlocks, and this has fixed the problem I mentioned before with the LWlock method. More details are in postgresql-prefetching-asyncio.README (search for "Checking AIO Completion" ) I also have worked my benchmark database and one application into a publishable state. However, the database is in the form of a compressed pg_dump and is around 218 MB in size. I was hoping to come up with a generation program to load it but have not had the time for that. Is there some place on postgresql.org for such a large file? If not I will try to think of some place for it. John Lumby
Attachment
On 08/20/2014 12:17 AM, John Lumby wrote: > I am attaching a new version of the patch for consideration in the current commit fest. Thanks for working on this! > Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl > the method for handling aio completion using sigevent has been re-written to use > signals exclusively rather than a composite of signals and LWlocks, > and this has fixed the problem I mentioned before with the LWlock method. ISTM the patch is still allocating stuff in shared memory that really doesn't belong there. Namely, the BufferAiocb structs. Or at least parts of it; there's also a waiter queue there which probably needs to live in shared memory, but the rest of it does not. At least BufAWaitAioCompletion is still calling aio_error() on an AIO request that might've been initiated by another backend. That's not OK. Please write the patch without atomic CAS operation. Just use a spinlock. There's a patch in the commitfest to add support for that, but it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult to read. Same with all the other #ifdefs; please just pick a method that works. Also, please split prefetching of regular index scans into a separate patch. It's orthogonal to doing async I/O; we could prefetch regular index scans with posix_fadvise too, and AIO should be useful for prefetching other stuff. - Heikki
On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Also, please split prefetching of regular index scans into a separate patch. > It's orthogonal to doing async I/O; we could prefetch regular index scans > with posix_fadvise too, and AIO should be useful for prefetching other > stuff. That patch already happened on the list, and it wasn't a win in many cases. I'm not sure it should be proposed independently of this one. Maybe a separate patch, but it should be considered dependent on this. I don't have an archive link at hand atm, but I could produce one later.
<div class="moz-cite-prefix"><font face="Nimbus Roman No9 L">Thanks for the replies and thoughts.</font><br /><br /> On 08/19/1418:27, Heikki Linnakangas wrote:<br /></div><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite">On 08/20/201412:17 AM, John Lumby wrote: <br /><blockquote type="cite">I am attaching a new version of the patch for considerationin the current commit fest. <br /></blockquote><br /> Thanks for working on this! <br /><br /><blockquote type="cite">Relativeto the one I submitted on 25 June in <a class="moz-txt-link-abbreviated" href="mailto:BAY175-W412FF89303686022A9F16AA3190@phx.gbl">BAY175-W412FF89303686022A9F16AA3190@phx.gbl</a><br/> the methodfor handling aio completion using sigevent has been re-written to use <br /> signals exclusively rather than a compositeof signals and LWlocks, <br /> and this has fixed the problem I mentioned before with the LWlock method. <br /></blockquote><br/> ISTM the patch is still allocating stuff in shared memory that really doesn't belong there. Namely,the BufferAiocb structs. Or at least parts of it; there's also a waiter queue there which probably needs to live inshared memory, but the rest of it does not. <br /></blockquote><br /><font face="Nimbus Roman No9 L">Actually the reasonthe BufferAiocb ( the postgresql block corresponding to the aio's aiocb)<br /> must be located in shared memory isthat, as you said, it acts as anchor for the waiter list.<br /> See further comment below.</font><br /><br /><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"><br /> At least BufAWaitAioCompletion is still calling aio_error()on an AIO request that might've been initiated by another backend. That's not OK. <br /></blockquote><br /><fontface="Nimbus Roman No9 L">Yes, you are right, and I agree with this one -<br /> I will add a aio_error_return_codefield in the </font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font><br/> and only the originator will set this from the real aiocb.</font><br /><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com"type="cite"><br /> Please write the patch without atomic CAS operation. Just use aspinlock. </blockquote><br /><font face="Nimbus Roman No9 L">Umm, this is a new criticism I think. I use CAS for thingsother than locking,<br /> such as add/remove from shared queue. I suppose maybe a spinlock on the entire queue<br/> can be used equivalently, but with more code (extra confusion) and worse performance<br /> (coarser serialization). What is your objection to using gcc's atomic ops? Portability?</font><br /><br /><br /><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite">There's a patch in the commitfest to add support for that,</blockquote><br/><font face="Nimbus Roman No9 L">sorry, support for what? There are already spinlocks in postgresql, <br /> you mean some new kind? please point me at it with hacker msgid or something.</font><br /><br /><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"> but it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAPifdefs make the patch more difficult to read. Same with all the other #ifdefs; please justpick a method that works. <br /></blockquote><font face="Nimbus Roman No9 L"><br /> Ok, yes, the ifdefs are unpleasant. I will do something about that.<br /> Ideally they would be entirely confined to header files and only macrofunctions<br /> in C files - maybe I can do that. And eventually when the dust has settled<br /> eliminate obsoleteifdef blocks altogether.</font><br /><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite">Also, pleasesplit prefetching of regular index scans into a separate patch. It's orthogonal to doing async I/O;</blockquote><br/><font face="Nimbus Roman No9 L">actually not completely orthogonal, see next</font><br /><br /><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"> we could prefetch regular index scans with posix_fadvisetoo, and AIO should be useful for prefetching other stuff. <br /></blockquote><br /><div class="moz-cite-prefix">On08/19/14 19:10, Claudio Freire wrote:<br /></div><blockquote cite="mid:CAGTBQpbrHCJvkCmXmA8zHmfKNLLtqbgagLaYA+Z9E-F0oY2EtQ@mail.gmail.com"type="cite"><pre wrap="">On Tue, Aug 19, 2014at 7:27 PM, Heikki Linnakangas <a class="moz-txt-link-rfc2396E" href="mailto:hlinnakangas@vmware.com"><hlinnakangas@vmware.com></a> wrote: </pre><blockquote type="cite"><pre wrap="">Also, please split prefetching of regular index scans into a separate patch. ... </pre></blockquote><pre wrap="">That patch already happened on the list, and it wasn't a win in many cases. I'm not sure it should be proposed independently of this one. Maybe a separate patch, but it should be considered dependent on this. I don't have an archive link at hand atm, but I could produce one later.</pre></blockquote><font face="Nimbus Roman No9 L">Severalpeople have asked to split this patch into several smaller ones and I<br /> have thought about it. It wouldintroduce some awkward dependencies.<br /> E.g. splitting the scanner code (index, relation-heap) into separate patch<br/> from aio code would mean that the scanner patch becomes dependent<br /> on the aio patch. They are not quiteorthogonal.<br /><br /> The reason is that the scanners call a new function, DiscardBuffer(blockid)<br /> when aio isin use. We can get around it by providing a stub for that function<br /> in the scanner patch, but then there issome risk of someone getting the<br /> wrong version of that function in their build. It just adds yet more complexity<br/> and breakage opportunities.<br /></font><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite"><br/> - Heikki <br /><br /></blockquote><font face="Nimbus Roman No9 L">One further comment concerning these</font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> and aiocb control blocks<br/> being in shared memory :<br /><br /> I explained above that the </font><font face="Nimbus Roman No9 L"><fontface="Nimbus Roman No9 L">BufferAiocb</font> must be in shared memory for wait/post.<br /> The aiocb does not necessarilyhave to be in shared memory,<br /> but since there is a one-to-one relationship between </font><font face="NimbusRoman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> and aiocb,<br /> it makes the code much simpler, and the operation much more efficient,<br /> if the aiocb is embedded in the </font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> as I have it now.<br /> E.g. if the aiocb is in private-processmemory, then an additional allocation<br /> scheme is needed (fixed number? palloc()in'g extra ones asneeded? ...)<br /> which adds complexity, and probably some inefficiency since a shared pool is usually<br /> more efficient(allows higher maximum per process etc), and there would have to be<br /> some pointer de-referencing from </font><fontface="Nimbus Roman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> to aiocb adding some (small)overhead.<br /><br /> I understood your objection to use of shared memory as being that you don't want<br /> a non-originatorto access the originator's aiocb using aio_xxx calls, and, with the<br /> one change I've said I willdo above (put the aio_error retcode in the </font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9L">BufferAiocb</font>)<br /> I will have achieved that requirement. I am hoping this answers your objections<br /> concerningshared memory.<br /></font><br /><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite"><br /><br/></blockquote><br />
On 08/25/2014 12:49 AM, johnlumby wrote: > On 08/19/14 18:27, Heikki Linnakangas wrote: >> Please write the patch without atomic CAS operation. Just use a spinlock. > > Umm, this is a new criticism I think. Yeah. Be prepared that new issues will crop up as the patch gets slimmer and easier to review :-). Right now there's still so much chaff that it's difficult to see the wheat. > I use CAS for things other > than locking, > such as add/remove from shared queue. I suppose maybe a spinlock on > the entire queue > can be used equivalently, but with more code (extra confusion) and > worse performance > (coarser serialization). What is your objection to using gcc's > atomic ops? Portability? Yeah, portability. Atomic ops might make sense, but at this stage it's important to slim down the patch to the bare minimum, so that it's easier to review. Once the initial patch is in, you can improve it with additional patches. >> There's a patch in the commitfest to add support for that, > > sorry, support for what? There are already spinlocks in postgresql, > you mean some new kind? please point me at it with hacker msgid or > something. Atomic ops: https://commitfest.postgresql.org/action/patch_view?id=1314 Once that's committed, you can use the new atomic ops in your patch. But until then, stick to spinlocks. > On 08/19/14 19:10, Claudio Freire wrote: >> On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> Also, please split prefetching of regular index scans into a separate patch. ... >> That patch already happened on the list, and it wasn't a win in many >> cases. I'm not sure it should be proposed independently of this one. >> Maybe a separate patch, but it should be considered dependent on this. >> >> I don't have an archive link at hand atm, but I could produce one later. > Several people have asked to split this patch into several smaller ones > and I > have thought about it. It would introduce some awkward dependencies. > E.g. splitting the scanner code (index, relation-heap) into separate > patch > from aio code would mean that the scanner patch becomes dependent > on the aio patch. They are not quite orthogonal. Right now, please focus on the main AIO patch. That should show a benefit for bitmap heap scans too, so to demonstrate the benefits of AIO, you don't need to prefetch regular index scans. The main AIO patch can be written, performance tested, and reviewed without caring about regular index scans at all. > The reason is that the scanners call a new function, DiscardBuffer(blockid) > when aio is in use. We can get around it by providing a stub for > that function > in the scanner patch, but then there is some risk of someone getting the > wrong version of that function in their build. It just adds yet more > complexity > and breakage opportunities. Regardless of the regular index scans, we'll need to discuss the new API of PrefetchBuffer and DiscardBuffer. It would be simpler for the callers if you didn't require the DiscardBuffer calls. I think it would be totally feasible to write the patch that way. Just drop the buffer pin after the asynchronous read finishes. When the caller actually needs the page, it will call ReadBuffer which will pin it again. You'll get a little bit more bufmgr traffic that way, but I think it'll be fine in practice. > One further comment concerning these BufferAiocb and aiocb control blocks > being in shared memory : > > I explained above that the BufferAiocb must be in shared memory for > wait/post. > The aiocb does not necessarily have to be in shared memory, > but since there is a one-to-one relationship between BufferAiocb and aiocb, > it makes the code much simpler , and the operation much more efficient, > if the aiocb is embedded in the BufferAiocb as I have it now. > E.g. if the aiocb is in private-process memory, then an additional > allocation > scheme is needed (fixed number? palloc()in'g extra ones as needed? ...) > which adds complexity, Yep, use palloc or a fixed pool. There's nothing wrong with that. > and probably some inefficiency since a shared > pool is usually > more efficient (allows higher maximum per process etc), and there > would have to be > some pointer de-referencing from BufferAiocb to aiocb adding some > (small) overhead. I think you're falling into the domain of premature optimization. A few pointer dereferences are totally insignificant, and the amount of memory you're saving pales in comparison to other similar non-shared pools and caches we have (catalog caches, for example). And on the other side of the coin, with a shared pool you'll waste memory when async I/O is not used (e.g because everything fits in RAM), and when it is used, you'll have more contention on locks and cache lines when multiple processes use the same objects. The general guideline in PostgreSQL is that everything is backend-private, except structures used to communicate between backends. - Heikki