Thread: Re: Extended Prefetching using Asynchronous IO - proposal and patch

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:

The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.

As mentioned before,  if anyone wishes to try this feature out
on 9.3.4,   I will be making a patch for that soon
which I can supply on request.


John Lumby
Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 05/28/2014 11:52 PM, John Lumby wrote:
>
> The patch is attached.
> It is based on clone of today's 9.4dev source.
> I have noticed that this source is
> (not suprisingly) quite a moving target at present,
> meaning that this patch becomes stale quite quickly.
> So although this copy is fine for reviewing,
> it may quite probably soon not be correct
> for the current source tree.
>
> As mentioned before,  if anyone wishes to try this feature out
> on 9.3.4,   I will be making a patch for that soon
> which I can supply on request.

Wow, that's a huge patch. I took a very brief look, focusing on the 
basic design. ignoring the style & other minor things for now:

The patch seems to assume that you can put the aiocb struct in shared 
memory, initiate an asynchronous I/O request from one process, and wait 
for its completion from another process. I'm pretty surprised if that 
works on any platform.

How portable is POSIX aio nowadays? Googling around, it still seems that 
on Linux, it's implemented using threads. Does the thread-emulation 
implementation cause problems with the rest of the backend, which 
assumes that there is only a single thread? In any case, I think we'll 
want to encapsulate the AIO implementation behind some kind of an API, 
to allow other implementations to co-exist.

Benchmarks?
- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
Thanks for looking at it!

> Date: Thu, 29 May 2014 00:19:33 +0300
> From: hlinnakangas@vmware.com
> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> CC: klaussfreire@gmail.com
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 05/28/2014 11:52 PM, John Lumby wrote:
> >
>
> The patch seems to assume that you can put the aiocb struct in shared
> memory, initiate an asynchronous I/O request from one process, and wait
> for its completion from another process. I'm pretty surprised if that
> works on any platform.

It works on linux.    Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()

This also plays very nicely with the syncscan where the cohorts run close together.

If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good,  as I can't verify it myself.

>
> How portable is POSIX aio nowadays? Googling around, it still seems that
> on Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which
> assumes that there is only a single thread? In any case, I think we'll

Good question,   I am not aware of any dependency on a backend having only
a single thread.    Can you please point me to such dependencies?


> want to encapsulate the AIO implementation behind some kind of an API,
> to allow other implementations to co-exist.

It is already  -    it follows the smgr(stg mgr) -> md (mag disk) ->  fd (filesystem)
layering used for the existing filesystem ops including posix-fadvise.

>
> Benchmarks?

I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs   ...


> - Heikki

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 05/29/2014 04:12 PM, John Lumby wrote:
> Thanks for looking at it!
>
>> Date: Thu, 29 May 2014 00:19:33 +0300
>> From: hlinnakangas@vmware.com
>> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>> CC: klaussfreire@gmail.com
>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>>
>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>
>>
>> The patch seems to assume that you can put the aiocb struct in shared
>> memory, initiate an asynchronous I/O request from one process, and wait
>> for its completion from another process. I'm pretty surprised if that
>> works on any platform.
>
> It works on linux.    Actually this ability allows the asyncio implementation to
> reduce complexity in one respect (yes I know it looks complex enough) :
> it makes waiting for completion of an in-progress IO simpler than for
> the existing synchronous IO case,.   since librt takes care of the waiting.
> specifically,   no need for extra wait-for-io control blocks
> such as in bufmgr's  WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when the I/O is finished, the worker thread fills them in. aio_error()
and aio_return() just return the values of those fields, so calling
aio_error() or aio_return() do in fact happen to work from a different
process. aio_suspend(), however, is implemented by sleeping on a
process-local mutex, which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from a portability point of view. No, the only sane way to make this
work is that the process that initiates an I/O request is responsible
for completing it. If another process needs to wait for an async I/O to
complete, we must use some other means to do the waiting. Like the
io_in_progress_lock that we already have, for the same purpose.

- Heikki


Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 05/29/2014 04:12 PM, John Lumby wrote:
>>
>> Thanks for looking at it!
>>
>>> Date: Thu, 29 May 2014 00:19:33 +0300
>>> From: hlinnakangas@vmware.com
>>> To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>>> CC: klaussfreire@gmail.com
>>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO -
>>> proposal and patch
>>>
>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>
>>>>
>>>
>>> The patch seems to assume that you can put the aiocb struct in shared
>>> memory, initiate an asynchronous I/O request from one process, and wait
>>> for its completion from another process. I'm pretty surprised if that
>>> works on any platform.
>>
>>
>> It works on linux.    Actually this ability allows the asyncio
>> implementation to
>> reduce complexity in one respect (yes I know it looks complex enough) :
>> it makes waiting for completion of an in-progress IO simpler than for
>> the existing synchronous IO case,.   since librt takes care of the
>> waiting.
>> specifically,   no need for extra wait-for-io control blocks
>> such as in bufmgr's  WaitIO()
>
>
> [checks]. No, it doesn't work. See attached test program.
>
> It kinda seems to work sometimes, because of the way it's implemented in
> glibc. The aiocb struct has a field for the result value and errno, and when
> the I/O is finished, the worker thread fills them in. aio_error() and
> aio_return() just return the values of those fields, so calling aio_error()
> or aio_return() do in fact happen to work from a different process.
> aio_suspend(), however, is implemented by sleeping on a process-local mutex,
> which does not work from a different process.
>
> Even if it worked on Linux today, it would be a bad idea to rely on it from
> a portability point of view. No, the only sane way to make this work is that
> the process that initiates an I/O request is responsible for completing it.
> If another process needs to wait for an async I/O to complete, we must use
> some other means to do the waiting. Like the io_in_progress_lock that we
> already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 05/29/2014 11:34 PM, Claudio Freire wrote:
> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>
>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>
>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>> for its completion from another process. I'm pretty surprised if that
>>>> works on any platform.
>>>
>>> It works on linux.    Actually this ability allows the asyncio
>>> implementation to
>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>> it makes waiting for completion of an in-progress IO simpler than for
>>> the existing synchronous IO case,.   since librt takes care of the
>>> waiting.
>>> specifically,   no need for extra wait-for-io control blocks
>>> such as in bufmgr's  WaitIO()
>>
>> [checks]. No, it doesn't work. See attached test program.
>>
>> It kinda seems to work sometimes, because of the way it's implemented in
>> glibc. The aiocb struct has a field for the result value and errno, and when
>> the I/O is finished, the worker thread fills them in. aio_error() and
>> aio_return() just return the values of those fields, so calling aio_error()
>> or aio_return() do in fact happen to work from a different process.
>> aio_suspend(), however, is implemented by sleeping on a process-local mutex,
>> which does not work from a different process.
>>
>> Even if it worked on Linux today, it would be a bad idea to rely on it from
>> a portability point of view. No, the only sane way to make this work is that
>> the process that initiates an I/O request is responsible for completing it.
>> If another process needs to wait for an async I/O to complete, we must use
>> some other means to do the waiting. Like the io_in_progress_lock that we
>> already have, for the same purpose.
>
> But calls to it are timeouted by 10us, effectively turning the thing
> into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a 
way that's known to be broken, in a loop, is a pretty crappy way of polling.

> Which is odd now that I read carefully, is how come 256 waits of 10us
> amounts to anything? That's just 2.5ms, not enough IIUC for any normal
> I/O to complete

Wild guess: the buffer you're reading is already in OS cache.

- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 05/29/2014 11:34 PM, Claudio Freire wrote:
>>
>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>
>>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>>
>>>>
>>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>>
>>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>>> for its completion from another process. I'm pretty surprised if that
>>>>> works on any platform.
>>>>
>>>>
>>>> It works on linux.    Actually this ability allows the asyncio
>>>> implementation to
>>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>>> it makes waiting for completion of an in-progress IO simpler than for
>>>> the existing synchronous IO case,.   since librt takes care of the
>>>> waiting.
>>>> specifically,   no need for extra wait-for-io control blocks
>>>> such as in bufmgr's  WaitIO()
>>>
>>>
>>> [checks]. No, it doesn't work. See attached test program.
>>>
>>> It kinda seems to work sometimes, because of the way it's implemented in
>>> glibc. The aiocb struct has a field for the result value and errno, and
>>> when
>>> the I/O is finished, the worker thread fills them in. aio_error() and
>>> aio_return() just return the values of those fields, so calling
>>> aio_error()
>>> or aio_return() do in fact happen to work from a different process.
>>> aio_suspend(), however, is implemented by sleeping on a process-local
>>> mutex,
>>> which does not work from a different process.
>>>
>>> Even if it worked on Linux today, it would be a bad idea to rely on it
>>> from
>>> a portability point of view. No, the only sane way to make this work is
>>> that
>>> the process that initiates an I/O request is responsible for completing
>>> it.
>>> If another process needs to wait for an async I/O to complete, we must
>>> use
>>> some other means to do the waiting. Like the io_in_progress_lock that we
>>> already have, for the same purpose.
>>
>>
>> But calls to it are timeouted by 10us, effectively turning the thing
>> into polling mode.
>
>
> We don't want polling... And even if we did, calling aio_suspend() in a way
> that's known to be broken, in a loop, is a pretty crappy way of polling.

Agreed. Just saying that that's why it works.

The PID of the process responsible for the aio should be there
somewhere, and other processes should explicitly synchronize (or
initiate their own I/O and let the OS merge them - which also works).

>
>
>> Which is odd now that I read carefully, is how come 256 waits of 10us
>> amounts to anything? That's just 2.5ms, not enough IIUC for any normal
>> I/O to complete
>
>
> Wild guess: the buffer you're reading is already in OS cache.

My wild guess as well.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 05/29/2014 11:34 PM, Claudio Freire wrote:
>>
>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>
>>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>>
>>>>
>>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>>
>>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>>> for its completion from another process. I'm pretty surprised if that
>>>>> works on any platform.
>>>>
>>>>
>>>> It works on linux.    Actually this ability allows the asyncio
>>>> implementation to
>>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>>> it makes waiting for completion of an in-progress IO simpler than for
>>>> the existing synchronous IO case,.   since librt takes care of the
>>>> waiting.
>>>> specifically,   no need for extra wait-for-io control blocks
>>>> such as in bufmgr's  WaitIO()
>>>
>>>
>>> [checks]. No, it doesn't work. See attached test program.
>>>
>>> It kinda seems to work sometimes, because of the way it's implemented in
>>> glibc. The aiocb struct has a field for the result value and errno, and
>>> when
>>> the I/O is finished, the worker thread fills them in. aio_error() and
>>> aio_return() just return the values of those fields, so calling
>>> aio_error()
>>> or aio_return() do in fact happen to work from a different process.
>>> aio_suspend(), however, is implemented by sleeping on a process-local
>>> mutex,
>>> which does not work from a different process.
>>>
>>> Even if it worked on Linux today, it would be a bad idea to rely on it
>>> from
>>> a portability point of view. No, the only sane way to make this work is
>>> that
>>> the process that initiates an I/O request is responsible for completing
>>> it.
>>> If another process needs to wait for an async I/O to complete, we must
>>> use
>>> some other means to do the waiting. Like the io_in_progress_lock that we
>>> already have, for the same purpose.
>>
>>
>> But calls to it are timeouted by 10us, effectively turning the thing
>> into polling mode.
>
>
> We don't want polling... And even if we did, calling aio_suspend() in a way
> that's known to be broken, in a loop, is a pretty crappy way of polling.


Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Tom Lane
Date:
Claudio Freire <klaussfreire@gmail.com> writes:
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)

"ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).
        regards, tom lane



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> Didn't fix that, but the attached patch does fix regression tests when
>> scanning over index types other than btree (was invoking elog when the
>> index am didn't have ampeeknexttuple)
>
> "ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> nbtree/README).


It's not really the tuple, just the tid



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Claudio Freire <klaussfreire@gmail.com> writes:
>>> Didn't fix that, but the attached patch does fix regression tests when
>>> scanning over index types other than btree (was invoking elog when the
>>> index am didn't have ampeeknexttuple)
>>
>> "ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>> nbtree/README).
>
>
> It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Though, your comment does illustrate the need to document that on
ampeeknexttuple, for future users.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:


> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 05/29/2014 11:34 PM, Claudio Freire wrote:
> >>
> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> >> <hlinnakangas@vmware.com> wrote:
> >>>
> >>> On 05/29/2014 04:12 PM, John Lumby wrote:
> >>>>
> >>>>
> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
> >>>>>
> >>>>> The patch seems to assume that you can put the aiocb struct in shared
> >>>>> memory, initiate an asynchronous I/O request from one process, and wait
> >>>>> for its completion from another process. I'm pretty surprised if that
> >>>>> works on any platform.
> >>>>
> >>>>
> >>>> It works on linux. Actually this ability allows the asyncio
> >>>> implementation to
> >>>> reduce complexity in one respect (yes I know it looks complex enough) :
> >>>> it makes waiting for completion of an in-progress IO simpler than for
> >>>> the existing synchronous IO case,. since librt takes care of the
> >>>> waiting.
> >>>> specifically, no need for extra wait-for-io control blocks
> >>>> such as in bufmgr's WaitIO()
> >>>
> >>>
> >>> [checks]. No, it doesn't work. See attached test program.

Thanks for checking    and thanks for coming up with that test program.
However,  yes,  it really does work  --  always  (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However,  the rule is,  call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it.   And I am attaching corrected version
of your test program which runs just fine.


> >>>
> >>> It kinda seems to work sometimes, because of the way it's implemented in
> >>> glibc. The aiocb struct has a field for the result value and errno, and
> >>> when
> >>> the I/O is finished, the worker thread fills them in. aio_error() and
> >>> aio_return() just return the values of those fields, so calling
> >>> aio_error()
> >>> or aio_return() do in fact happen to work from a different process.
> >>> aio_suspend(), however, is implemented by sleeping on a process-local
> >>> mutex,
> >>> which does not work from a different process.
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> >>
> >>
> >> But calls to it are timeouted by 10us, effectively turning the thing
> >> into polling mode.
> >
> >
> > We don't want polling... And even if we did, calling aio_suspend() in a way
> > that's known to be broken, in a loop, is a pretty crappy way of polling.

Well,  as mentioned earlier,  it is not broken.     Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.      I will see if I can verify that.
Where are you (Claudio) seeing 10us?

>
>
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)
Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Tom Lane
Date:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> "ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
>>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>>> nbtree/README).

>> It's not really the tuple, just the tid

> And, furthermore, it's used only to do prefetching, so even if the tid
> was invalid when the tuple needs to be accessed, it wouldn't matter,
> because the indexam wouldn't use the result of ampeeknexttuple to do
> anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock
guaranteeing that the *previous* tid will still mean something by the time
you arrive at its heap page.  I presume that the ampeeknexttuple call is
issued before trying to visit the heap (otherwise you're not actually
getting much I/O overlap), so I think there's a real risk here.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.
        regards, tom lane



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:


> From: tgl@sss.pgh.pa.us
> To: klaussfreire@gmail.com
> CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> Date: Thu, 29 May 2014 17:56:57 -0400
>
> Claudio Freire <klaussfreire@gmail.com> writes:
> > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
> >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> >>> nbtree/README).
>
> >> It's not really the tuple, just the tid
>
> > And, furthermore, it's used only to do prefetching, so even if the tid
> > was invalid when the tuple needs to be accessed, it wouldn't matter,
> > because the indexam wouldn't use the result of ampeeknexttuple to do
> > anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes  bad name,  sorry,  can change it ...
never advances to another page  :

 *    btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list
 *                           without reading a new index page
 *                       and without causing any side-effects such as altering values in control blocks
 *               if found,     store blocknum in next element of pfch_list


> guaranteeing that the *previous* tid will still mean something by the time
> you arrive at its heap page. I presume that the ampeeknexttuple call is
> issued before trying to visit the heap (otherwise you're not actually
> getting much I/O overlap), so I think there's a real risk here.
>
> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.
>
> regards, tom lane

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> "ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
>>>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>>>> nbtree/README).
>
>>> It's not really the tuple, just the tid
>
>> And, furthermore, it's used only to do prefetching, so even if the tid
>> was invalid when the tuple needs to be accessed, it wouldn't matter,
>> because the indexam wouldn't use the result of ampeeknexttuple to do
>> anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock
> guaranteeing that the *previous* tid will still mean something by the time

No, no... that's exactly why a new regproc is needed, because for
prefetching, we need to get the next tid that satisfies some
conditions *without* walking the index.

This, in nbtree, only looks through the tid array to find the suitable
tid, or just return false if the array is exhausted.

> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.

I think system index scans will also invoke this. There's no rule
excluding that possibility.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 7:11 PM, John Lumby <johnlumby@hotmail.com> wrote:
>> Nonetheless, getting the next tid out of the index may involve stepping
>> to the next index page, at which point you've lost your interlock
>
> I think we are ok as peeknexttuple (yes  bad name,  sorry,  can change it
> ...
> never advances to another page  :
>
>  *    btpeeknexttuple() -- peek at the next tuple different from any
> blocknum in pfch_list
>  *                           without reading a new index page
>  *                       and without causing any side-effects such as
> altering values in control blocks
>  *               if found,     store blocknum in next element of pfch_list


Yeah, I was getting to that conclusion myself too.

We could call it amprefetchnextheap, since it does just prefetch, and
is good for nothing *but* prefetch.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:


> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: hlinnakangas@vmware.com
> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>

> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.


I meant to add  -  it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer,    i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below).    We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind.   (Which is also taken
care of in the patch).


> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Andres Freund
Date:
Hi,

On 2014-05-29 17:53:51 -0400, John Lumby wrote:
> to see how we have done it.   And I am attaching corrected version
> of your test program which runs just fine.

It's perfectly fine to not be up to the coding style at this point, but
trying to adhere to it to some degree will make code review later less
painfull...
* comments with **
* line length
* tabs vs spaces
* ...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 6:53 PM, John Lumby <johnlumby@hotmail.com> wrote:
> Well,  as mentioned earlier,  it is not broken.     Whether it is efficient
> I am not sure.
> I have looked at the mutex in aio_suspend that you mentioned and I am not
> quite convinced that,  if caller is not the original aio_read process,
> it renders the suspend() into an instant timeout.      I will see if I can
> verify that.
> Where are you (Claudio) seeing 10us?


fd.c, in FileCompleteaio, sets timeout to:

my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;

Which is 10k ns, which is 10 us.

It loops 256 times at most, so it's polling 256 times with a 10 us
timeout. Sounds wasteful.

I'd:

1) If it's the same process, wait for the full timeout (no looping).
If you have to loop (EAGAIN or EINTR - which I just noticed you don't
check for), that's ok.

2) If it's not, just fall through, don't wait, issue the I/O. The
kernel will merge the requests.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, May 29, 2014 at 7:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> 1) If it's the same process, wait for the full timeout (no looping).
> If you have to loop (EAGAIN or EINTR - which I just noticed you don't
> check for), that's ok.

Sorry, meant to say just looping on EINTR.

About the style guidelines, no, I just copy the style of surrounding
code usually.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 05/30/2014 12:53 AM, John Lumby wrote:
>> Date: Thu, 29 May 2014 18:00:28 -0300
>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>> From: klaussfreire@gmail.com
>> To: hlinnakangas@vmware.com
>> CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>>
>> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> On 05/29/2014 11:34 PM, Claudio Freire wrote:
>>>>
>>>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
>>>> <hlinnakangas@vmware.com> wrote:
>>>>>
>>>>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>>>>
>>>>>>
>>>>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>>>>
>>>>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>>>>> for its completion from another process. I'm pretty surprised if that
>>>>>>> works on any platform.
>>>>>>
>>>>>>
>>>>>> It works on linux.    Actually this ability allows the asyncio
>>>>>> implementation to
>>>>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>>>>> it makes waiting for completion of an in-progress IO simpler than for
>>>>>> the existing synchronous IO case,.   since librt takes care of the
>>>>>> waiting.
>>>>>> specifically,   no need for extra wait-for-io control blocks
>>>>>> such as in bufmgr's  WaitIO()
>>>>>
>>>>>
>>>>> [checks]. No, it doesn't work. See attached test program.
>
> Thanks for checking    and thanks for coming up with that test program.
> However,  yes,  it really does work  --  always  (on linux).
> Your test program is doing things in the wrong order -
> it calls aio_suspend *before* aio_error.
> However,  the rule is,  call aio_suspend *after* aio_error
> and *only* if aio_error returns EINPROGRESS.

I see no such a rule in the man pages of any of the functions involved. 
And it wouldn't matter anyway; the behavior is exactly the same if you 
aio_error() first.

> See the code changes to fd.c function FileCompleteaio()
> to see how we have done it.   And I am attaching corrected version
> of your test program which runs just fine.

As Claudio mentioned earlier, the way FileCompleteaio() uses aio_suspend 
is just a complicated way of polling. You might as well replace the 
aio_suspend() calls with pg_usleep().

>>>>> It kinda seems to work sometimes, because of the way it's implemented in
>>>>> glibc. The aiocb struct has a field for the result value and errno, and
>>>>> when
>>>>> the I/O is finished, the worker thread fills them in. aio_error() and
>>>>> aio_return() just return the values of those fields, so calling
>>>>> aio_error()
>>>>> or aio_return() do in fact happen to work from a different process.
>>>>> aio_suspend(), however, is implemented by sleeping on a process-local
>>>>> mutex,
>>>>> which does not work from a different process.
>>>>>
>>>>> Even if it worked on Linux today, it would be a bad idea to rely on it
>>>>> from
>>>>> a portability point of view. No, the only sane way to make this work is
>>>>> that
>>>>> the process that initiates an I/O request is responsible for completing
>>>>> it.
>>>>> If another process needs to wait for an async I/O to complete, we must
>>>>> use
>>>>> some other means to do the waiting. Like the io_in_progress_lock that we
>>>>> already have, for the same purpose.
>>>>
>>>> But calls to it are timeouted by 10us, effectively turning the thing
>>>> into polling mode.
>>>
>>> We don't want polling... And even if we did, calling aio_suspend() in a way
>>> that's known to be broken, in a loop, is a pretty crappy way of polling.
>
> Well,  as mentioned earlier,  it is not broken.     Whether it is efficient I am not sure.
> I have looked at the mutex in aio_suspend that you mentioned and I am not
> quite convinced that,  if caller is not the original aio_read process,
> it renders the suspend() into an instant timeout.      I will see if I can verify that.

I don't see the point of pursuing this design further. Surely we don't 
want to use polling here, and you're relying on undefined behavior 
anyway. I'm pretty sure aio_return/aio_error won't work from a different 
process on all platforms, even if it happens to work on Linux. Even on 
Linux, it might stop working if the underlying implementation changes 
from the glibc pthread emulation to something kernel-based.

- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>>> We don't want polling... And even if we did, calling aio_suspend() in a
>>>> way
>>>> that's known to be broken, in a loop, is a pretty crappy way of polling.
>>
>>
>> Well,  as mentioned earlier,  it is not broken.     Whether it is
>> efficient I am not sure.
>> I have looked at the mutex in aio_suspend that you mentioned and I am not
>> quite convinced that,  if caller is not the original aio_read process,
>> it renders the suspend() into an instant timeout.      I will see if I can
>> verify that.
>
>
> I don't see the point of pursuing this design further. Surely we don't want
> to use polling here, and you're relying on undefined behavior anyway. I'm
> pretty sure aio_return/aio_error won't work from a different process on all
> platforms, even if it happens to work on Linux. Even on Linux, it might stop
> working if the underlying implementation changes from the glibc pthread
> emulation to something kernel-based.

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
johnlumby
Date:
On 05/30/14 09:36, Claudio Freire wrote:
On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I see no such a rule in the man pages of any of the functions involved. And it wouldn't matter anyway; the behavior is exactly the same if you aio_error() first.

You are completely correct and I was wrong.
For the case of different-process,
It is not the order of the aio calls that makes it work/not work,
it is whether it polls with a timeout.
I think I was confusing this with a rule concerning order for
calling aio_return.


We don't want polling... And even if we did, calling aio_suspend() in a
way
that's known to be broken, in a loop, is a pretty crappy way of polling.

I have made a change to the complete_aio functions
to use polling *only* if the caller is not the originator of the aio_read.
If it is the originator,  it now calls aio_suspend with no timeout
(i.e. wait till complete).
The loop also now checks for the EINTR case which someone
pointed out.

In my test runs,    with a debug message in  FileCompleteaio to tell me
whether caller is or is not the originator of the aio_read,
I see > 99.8% of calls are from originator and only < 0.2% are not.
e.g.  (samples from two backends)

different : 10
same : 11726

different : 38
same : 12105

new patch based on today 140531 is attached,

This improves one of my benchmarks by about 10% throughput,
and now shows an overall 23% improvement relative to existing code with posix_fadvise.

So hopefully this addresses your performance concern.

If you look at the new patch,  you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier,    it could just as well sleep for the same timeout,
but the small advantage of calling aio_suspend is if the io completed just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler.   In fact this change is quite small,   just a few lines
in backend/storage/buffer/buf_async.c and  backend/storage/file/fd.c

Based on this,    I think it is not necessary  to get rid of the polling altogether
(and in any case, as far as I can see,  very difficult).


Well,  as mentioned earlier,  it is not broken.     Whether it is
efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.      I will see if I can
verify that.

I don't see the point of pursuing this design further. Surely we don't want
to use polling here, and you're relying on undefined behavior anyway. I'm
pretty sure aio_return/aio_error won't work from a different process on all
platforms, even if it happens to work on Linux. Even on Linux, it might stop
working if the underlying implementation changes from the glibc pthread
emulation to something kernel-based.

Good point.     I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective,  it will be caught in configure.


I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.


Claudio,  I am not quite sure if what I am submitting now is
quite the same as any of yours.       As I promised before,  but have
not yet done,  I will package one or two of my benchmarks and
send them in.



Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
johnlumby
Date:
<div class="moz-cite-prefix">On 05/31/14 20:44, johnlumby wrote:<br /></div><blockquote
cite="mid:538A776C.2060006@hotmail.com"type="cite"><div class="moz-cite-prefix">On 05/30/14 09:36, Claudio Freire
wrote:<br/></div><br /><font face="Nimbus Roman No9 L">Good point.     I have included the guts of your little test
program<br/> (modified to do polling) into the existing autoconf test program<br /> that decides on the <br /> #define 
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.<br/> See config/c-library.m4.<br /> I hope this goes some way to answer your concern
aboutrobustness,<br /> as at least now if the implementation changes in some way that<br /> renders the polling
ineffective, it will be caught in configure.<br /><br /></font><br /></blockquote><font face="Nimbus Roman No9 L">I
meantto add that by including this test,  which involves a fork(),<br /> in the autoconf tester,   on Windows<br />
USE_AIO_ATOMIC_BUILTIN_COMP_SWAPwould always by un-defined.<br /> (But could then be defined manually if someone wanted
togive it a try)<br /></font><br /> 

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Sat, May 31, 2014 at 9:44 PM, johnlumby <johnlumby@hotmail.com> wrote:
> I'll try to do some measuring of performance with:
> a) git head
> b) git head + patch as-is
> c) git head + patch without aio_suspend in foreign processes (just re-read)
> d) git head + patch with a lwlock (or whatever works) instead of aio_suspend
>
> a-c will be the fastest, d might take some while.
>
> I'll let you know of the results as I get them.
>
>
> Claudio,  I am not quite sure if what I am submitting now is
> quite the same as any of yours.       As I promised before,  but have
> not yet done,  I will package one or two of my benchmarks and
> send them in.


It's a tad different. c will not do polling on the foreign process, I
will just let PG do the read again. d will be like polling, but
without the associated CPU overhead.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 06/01/2014 03:44 AM, johnlumby wrote:
> If you look at the new patch,  you'll see that for the different-pid case,
> I still call aio_suspend with a timeout.
> As you or Claudio pointed out earlier,    it could just as well sleep
> for the same timeout,
> but the small advantage of calling aio_suspend is if the io completed
> just between
> the aio_error returning EINPROGRESS and the aio_suspend call.
> Also it makes the code simpler.   In fact this change is quite small,
> just a few lines
> in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c
>
> Based on this,    I think it is not necessary  to get rid of the polling
> altogether
> (and in any case, as far as I can see,  very difficult).

That's still just as wrong as it always has been. Just get rid of it. 
Don't put aiocb structs in shared memory at all. They don't belong there.

>>>> Well,  as mentioned earlier,  it is not broken.     Whether it is
>>>> efficient I am not sure.
>>>> I have looked at the mutex in aio_suspend that you mentioned and I am not
>>>> quite convinced that,  if caller is not the original aio_read process,
>>>> it renders the suspend() into an instant timeout.      I will see if I can
>>>> verify that.
>>>
>>> I don't see the point of pursuing this design further. Surely we don't want
>>> to use polling here, and you're relying on undefined behavior anyway. I'm
>>> pretty sure aio_return/aio_error won't work from a different process on all
>>> platforms, even if it happens to work on Linux. Even on Linux, it might stop
>>> working if the underlying implementation changes from the glibc pthread
>>> emulation to something kernel-based.
>
> Good point.     I have included the guts of your little test program
> (modified to do polling) into the existing autoconf test program
> that decides on the
> #define  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
> See config/c-library.m4.
> I hope this goes some way to answer your concern about robustness,
> as at least now if the implementation changes in some way that
> renders the polling ineffective,  it will be caught in configure.

No, that does not make it robust enough.

- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
johnlumby
Date:
updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
I'm having trouble setting max_async_io_prefetchers in postgresql.conf

It says it cannot be changed.

Is that fixed at initdb time? (sounds like a bad idea if it is)

On Sun, Jun 8, 2014 at 11:12 PM, johnlumby <johnlumby@hotmail.com> wrote:
> updated version of patch compatible with git head of 140608,
> (adjusted proc oid and a couple of minor fixes)



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Greg Stark
Date:
On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> How portable is POSIX aio nowadays? Googling around, it still seems that on
> Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which assumes
> that there is only a single thread? In any case, I think we'll want to
> encapsulate the AIO implementation behind some kind of an API, to allow
> other implementations to co-exist.

I think POSIX aio is pretty damn standard and it's a pretty fiddly
interface. If we abstract it behind an i/o interface we're going to
lose a lot of the power. Abstracting it behind a set of buffer manager
operations (initiate i/o on buffer, complete i/o on buffer, abort i/o
on buffer) should be fine but that's basically what we have, no?

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are
handled.

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself. You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise -- but index
scans would need to get some intelligence for when it makes sense to
read more than one page at a time.  It might be possible to do
something fairly coarse like having our i/o operators keep track of
how often i/o on a relation falls within a certain number of blocks of
an earlier i/o and autotune number of blocks to read based on that. It
might not be hard to do better than the kernel with even basic info
like what level of the index we're reading or what type of pointer
we're following.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.

-- 
greg



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote:
> I don't think the threaded implementation on Linux is the one to use
> though. I find this *super* confusing but the kernel definitely
> supports aio syscalls, glibc also has a threaded implementation it
> uses if run on a kernel that doesn't implement the syscalls, and I
> think there are existing libaio and librt libraries from outside glibc
> that do one or the other. Which you build against seems to make a big
> difference. My instinct is that anything but the kernel native
> implementation will be worthless. The overhead of thread communication
> will completely outweigh any advantage over posix_fadvise's partial
> win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.


> The main advantage of posix aio is that we can actually receive the
> data out of order. With posix_fadvise we can get the i/o and cpu
> overlap but we will never process the later blocks until the earlier
> requests are satisfied and processed in order. With aio you could do a
> sequential scan, initiating i/o on 1,000 blocks and then processing
> them as they arrive, initiating new requests as those blocks are
> handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

> When I investigated this I found the buffer manager's I/O bits seemed
> to already be able to represent the state we needed (i/o initiated on
> this buffer but not completed). The problem was in ensuring that a
> backend would process the i/o completion promptly when it might be in
> the midst of handling other tasks and might even get an elog() stack
> unwinding. The interface that actually fits Postgres best might be the
> threaded interface (orthogonal to the threaded implementation
> question) which is you give aio a callback which gets called on a
> separate thread when the i/o completes. The alternative is you give
> aio a list of operation control blocks and it tells you the state of
> all the i/o operations. But it's not clear to me how you arrange to do
> that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

> The other gotcha here is that the kernel implementation only does
> anything useful on DIRECT_IO files. That means you have to do *all*
> the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

> You would be doing that
> anyways for sequential scans and bitmap scans -- and we already do it
> with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

> Finally, when I did the posix_fadvise work I wrote a synthetic
> benchmark for testing the equivalent i/o pattern of a bitmap scan. It
> let me simulate bitmap scans of varying densities with varying
> parameters, notably how many i/o to keep in flight at once. It
> supported posix_fadvise or aio. You should look it up in the archives,
> it made for some nice looking graphs. IIRC I could not find any build
> environment where aio offered any performance boost at all. I think
> this means I just didn't know how to build it against the right
> libraries or wasn't using the right kernel or there was some skew
> between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many sectors per cylinder if it's
rotational, how big the segments if it's flash, etc, etc... all stuff
hidden away inside the kernel.

It's not a good idea to try to do the kernel's job. Aio, even
threaded, lets you avoid tha.

If you still have the benchmark around, I suggest you shuffle the
sectors a little bit (but not fully) and try them with semi-random
I/O.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Fujii Masao
Date:
On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote:
> updated version of patch compatible with git head of 140608,
> (adjusted proc oid and a couple of minor fixes)

Compilation of patched version on MacOS failed. The error messages were

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include   -c -o heapam.o heapam.c
heapam.c: In function 'heap_unread_add':
heapam.c:362: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:363: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:369: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:375: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:381: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:387: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:405: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heap_unread_subtract':
heapam.c:419: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:425: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:434: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:442: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:452: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:453: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:454: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:456: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heapgettup':
heapam.c:944: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:716: warning: unused variable 'ix'
heapam.c: In function 'heapgettup_pagemode':
heapam.c:1243: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:1029: warning: unused variable 'ix'
heapam.c: In function 'heap_endscan':
heapam.c:1808: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:1809: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
make[4]: *** [heapam.o] Error 1
make[3]: *** [heap-recursive] Error 2
make[2]: *** [access-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2


Huge patch is basically not easy to review. What about simplifying the patch
by excluding non-core parts like the change of pg_stat_statements, so that
reviewers can easily read the patch? We can add such non-core parts later.

Regards,

-- 
Fujii Masao



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
Thanks Fujii ,   that is a bug   --   an #ifdef  USE_PREFETCH is missing in heapam.c
   (maybe several)
I will fix it in the next patch version.

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :
  
     .   async io (librt functions)
     .   buffer management   (allocating, locking and pinning etc)
     .   scanners making prefetch calls
     .   statistics

    and the autoconf input program

I will see what I can do.   Maybe putting an indicator against each modified file?

I am currently working on two things :
  .   alternative way for non-originator of an aio_read to wait on completion
             (LWlock instead of polling the aiocb)
      This was talked about in several earlier posts and Claudio is also working on something there
  .   package up my benchmark

Cheers    John

----------------------------------------
> Date: Fri, 20 Jun 2014 04:21:19 +0900
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: masao.fujii@gmail.com
> To: johnlumby@hotmail.com
> CC: pgsql-hackers@postgresql.org; klaussfreire@gmail.com
>
> On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote:
>> updated version of patch compatible with git head of 140608,
>> (adjusted proc oid and a couple of minor fixes)
>
> Compilation of patched version on MacOS failed. The error messages were
>
> gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
> heapam.c: In function 'heap_unread_add':
> heapam.c:362: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:363: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:369: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:375: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:381: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:387: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:405: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heap_unread_subtract':
> heapam.c:419: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:425: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:434: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:442: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:452: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:453: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:454: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:456: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heapgettup':
> heapam.c:944: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:716: warning: unused variable 'ix'
> heapam.c: In function 'heapgettup_pagemode':
> heapam.c:1243: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:1029: warning: unused variable 'ix'
> heapam.c: In function 'heap_endscan':
> heapam.c:1808: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:1809: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> make[4]: *** [heapam.o] Error 1
> make[3]: *** [heap-recursive] Error 2
> make[2]: *** [access-recursive] Error 2
> make[1]: *** [install-backend-recurse] Error 2
> make: *** [install-src-recurse] Error 2
>
>
> Huge patch is basically not easy to review. What about simplifying the patch
> by excluding non-core parts like the change of pg_stat_statements, so that
> reviewers can easily read the patch? We can add such non-core parts later.
>
> Regards,
>
> --
> Fujii Masao


Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Jim Nasby
Date:
On 6/20/14, 5:12 PM, John Lumby wrote:
> I also appreciate it is not easy to review the patch.
> There are really 4 (or maybe 5) parts :
>
>       .   async io (librt functions)
>       .   buffer management   (allocating, locking and pinning etc)
>       .   scanners making prefetch calls
>       .   statistics
>
>      and the autoconf input program
>
> I will see what I can do.   Maybe putting an indicator against each modified file?

Generally the best way to handle cases like this is to create multiple patches, each of which builds upon previous
ones.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:

----------------------------------------
> Date: Thu, 19 Jun 2014 15:43:44 -0300
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire@gmail.com
> To: stark@mit.edu
> CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org
>
> On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote:
>> I don't think the threaded implementation on Linux is the one to use
>> though.  [... ] The overhead of thread communication
>> will completely outweigh any advantage over posix_fadvise's partial
>> win.
>
> What overhead?
>
> The only communication is through a "done" bit and associated
> synchronization structure when *and only when* you want to wait on it.
>

Threads do cost some extra CPU,  but provided the system had some
spare CPU capacity,  then performance improves because of reduced IO wait.
I quoted a measured improvement of  17% - 18% improvement in the README
along with some more explanation of when the asyc IO gives and improvement.

> Furthermore, posix_fadvise is braindead on this use case, been there,
> done that. What you win with threads is a better postgres-kernel
> interaction, even if you loose some CPU performance it's gonna beat
> posix_fadvise by a large margin.
>
> [...]
>
>> When I investigated this I found the buffer manager's I/O bits seemed
>> to already be able to represent the state we needed (i/o initiated on
>> this buffer but not completed). The problem was in ensuring that a
>> backend would process the i/o completion promptly when it might be in
>> the midst of handling other tasks and might even get an elog() stack
>> unwinding. The interface that actually fits Postgres best might be the
>> threaded interface (orthogonal to the threaded implementation
>> question) which is you give aio a callback which gets called on a
>> separate thread when the i/o completes. The alternative is you give
>> aio a list of operation control blocks and it tells you the state of
>> all the i/o operations. But it's not clear to me how you arrange to do
>> that regularly, promptly, and reliably.
>
> Indeed we've been musing about using librt's support of completion
> callbacks for that.

For the most common case of a backend issues a PrefetchBuffer
and then that *same* backend issues ReadBuffer,  the posix aio works
ideally,  since there is no need for any callback or completion signal,
we simply check "is it complete" during the ReadBuffer.

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier.    The current version of the patch did polling for that case
but that drew criticism,    and so an imminent new version of the patch
uses the sigevent mechanism.    And there are other ways still.

In an earlier posting I reported that ,  in my benchmark,
99.8% of [FileCompleteaio]  calls are from originator and only < 0.2% are not.so,  from a performance perspective, 
onlythe common case really matters.                           


Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Greg Stark
Date:
On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:
> It is when some *other* backend gets there first with the ReadBuffer that
> things are a bit trickier.    The current version of the patch did polling for that case
> but that drew criticism,    and so an imminent new version of the patch
> uses the sigevent mechanism.    And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer
indicating I/O is in progress. If another backend does ReadBuffer for
the same block they'll get the same buffer and then wait until the
first backend's I/O completes. ReadBuffer goes through some hoops to
handle this (and all the corner cases such as the other backend's I/O
completing and the buffer being reused for another block before the
first backend reawakens). It would be a shame to reinvent the wheel.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

-- 
greg



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:

----------------------------------------
> From: stark@mit.edu
> Date: Mon, 23 Jun 2014 16:04:50 -0700
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> To: johnlumby@hotmail.com
> CC: klaussfreire@gmail.com; hlinnakangas@vmware.com; pgsql-hackers@postgresql.org
>
> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:
>> It is when some *other* backend gets there first with the ReadBuffer that
>> things are a bit trickier. The current version of the patch did polling for that case
>> but that drew criticism, and so an imminent new version of the patch
>> uses the sigevent mechanism. And there are other ways still.
>
> I'm a bit puzzled by this though. Postgres *already* has code for this
> case. When you call ReadBuffer you set the bits on the buffer

Good question.     Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(),  performed from a ReadBuffer(),  and some other backend
wants that buffer.    asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,    and performs its aio_read into an allocated,  pinned,
postgresql buffer.    This is entirely different from the synchronous io case.
Why?      Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf,  and is then free to do any other
reading or writing it wishes,   such as more prefetching  or any other operation.
And furthermore,  it may *never* issue a ReadBuffer for the block which it
prefetched.

Therefore,  asynchronous IO is different from synchronous IO,  and
a new bit,  BM_AIO_IN_PROGRESS, in the buf_header  is required to
track this aio operation until completion.

I would encourage you to read the new
postgresql-prefetching-asyncio.README
in the patch file where this is explained in greater detail.

> indicating I/O is in progress. If another backend does ReadBuffer for
> the same block they'll get the same buffer and then wait until the
> first backend's I/O completes. ReadBuffer goes through some hoops to
> handle this (and all the corner cases such as the other backend's I/O
> completing and the buffer being reused for another block before the
> first backend reawakens). It would be a shame to reinvent the wheel.

No re-invention!   Actually some effort has been made to use the
existing functions in bufmgr.c as much as possible rather than
rewriting them.

>
> The problem with using the Buffers I/O in progress bit is that the I/O
> might complete while the other backend is busy doing stuff. As long as
> you can handle the I/O completion promptly -- either in callback or
> thread or signal handler then that wouldn't matter. But I'm not clear
> that any of those will work reliably.

They both work reliably,  but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported  (although it happens to work),  hence the proposed
change to use sigevent.

By the way,   on the "will it actually work though?" question which several folks
have raised,    I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.       I would guess it has had around
500 hours of operation by now.     I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.
 
>
> --
> greg


Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 06/24/2014 04:29 PM, John Lumby wrote:
>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:
>>> It is when some *other* backend gets there first with the ReadBuffer that
>>> things are a bit trickier. The current version of the patch did polling for that case
>>> but that drew criticism, and so an imminent new version of the patch
>>> uses the sigevent mechanism. And there are other ways still.
>>
>> I'm a bit puzzled by this though. Postgres *already* has code for this
>> case. When you call ReadBuffer you set the bits on the buffer
>
> Good question.     Let me explain.
> Yes, postgresql has code for the case of a backend is inside a synchronous
> read() or write(),  performed from a ReadBuffer(),  and some other backend
> wants that buffer.    asynchronous aio is initiated not from ReadBuffer
> but from PrefetchBuffer,    and performs its aio_read into an allocated,  pinned,
> postgresql buffer.    This is entirely different from the synchronous io case.
> Why?      Because the issuer of the aio_read (the "originator") is unaware
> of this buffer pinned on its behalf,  and is then free to do any other
> reading or writing it wishes,   such as more prefetching  or any other operation.
> And furthermore,  it may *never* issue a ReadBuffer for the block which it
> prefetched.

I still don't see the difference. Once an asynchronous read is initiated 
on the buffer, it can't be used for anything else until the read has 
finished. This is exactly the same situation as with a synchronous read: 
after read() is called, the buffer can't be used for anything else until 
the call finishes.

In particular, consider the situation from another backend's point of 
view. Looking from another backend (i.e. one that didn't initiate the 
read), there's no difference between a synchronous and asynchronous 
read. So why do we need a different IPC mechanism for the synchronous 
and asynchronous cases? We don't.

I understand that *within the backend*, you need to somehow track the 
I/O, and you'll need to treat synchronous and asynchronous I/Os 
differently. But that's all within the same backend, and doesn't need to 
involve the flags or locks in shared memory at all. The inter-process 
communication doesn't need any changes.

>> The problem with using the Buffers I/O in progress bit is that the I/O
>> might complete while the other backend is busy doing stuff. As long as
>> you can handle the I/O completion promptly -- either in callback or
>> thread or signal handler then that wouldn't matter. But I'm not clear
>> that any of those will work reliably.
>
> They both work reliably,  but the criticism was that backend B polling
> an aiocb of an aio issued by backend A is not documented as
> being supported  (although it happens to work),  hence the proposed
> change to use sigevent.

You didn't understand what Greg meant. You need to handle the completion 
of the I/O in the same process that initiated it, by clearing the 
in-progress bit of the buffer and releasing the I/O in-progress lwlock 
on it. And you need to do that very quickly after the I/O has finished, 
because there might be another backend waiting for the buffer and you 
don't want him to wait longer than necessary.

The question is, if you receive the notification of the I/O completion 
using a signal or a thread, is it safe to release the lwlock from the 
signal handler or a separate thread?

> By the way,   on the "will it actually work though?" question which several folks
> have raised,    I should mention that this patch has been in semi-production
> use for almost 2 years now in different stages of completion on all postgresql
> releases from 9.1.4 to 9.5 devel.       I would guess it has had around
> 500 hours of operation by now.     I'm sure there are bugs still to be
> found but I am confident it is fundamentally sound.

Well, a committable version of this patch is going to look quite 
different from the first version that you posted, so I don't put much 
weight on how long you've tested the first version.

- Heikki



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
Thanks Heikki,

----------------------------------------
> Date: Tue, 24 Jun 2014 17:02:38 +0300
> From: hlinnakangas@vmware.com
> To: johnlumby@hotmail.com; stark@mit.edu
> CC: klaussfreire@gmail.com; pgsql-hackers@postgresql.org
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 06/24/2014 04:29 PM, John Lumby wrote:
>>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:
>>>> It is when some *other* backend gets there first with the ReadBuffer that
>>>> things are a bit trickier. The current version of the patch did polling for that case
>>>> but that drew criticism, and so an imminent new version of the patch
>>>> uses the sigevent mechanism. And there are other ways still.
>>>
>>> I'm a bit puzzled by this though. Postgres *already* has code for this
>>> case. When you call ReadBuffer you set the bits on the buffer
>>
>> Good question. Let me explain.
>> Yes, postgresql has code for the case of a backend is inside a synchronous
>> read() or write(), performed from a ReadBuffer(), and some other backend
>> wants that buffer. asynchronous aio is initiated not from ReadBuffer
>> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
>> postgresql buffer. This is entirely different from the synchronous io case.
>> Why? Because the issuer of the aio_read (the "originator") is unaware
>> of this buffer pinned on its behalf, and is then free to do any other
>> reading or writing it wishes, such as more prefetching or any other operation.
>> And furthermore, it may *never* issue a ReadBuffer for the block which it
>> prefetched.
>
> I still don't see the difference. Once an asynchronous read is initiated
> on the buffer, it can't be used for anything else until the read has
> finished. This is exactly the same situation as with a synchronous read:
> after read() is called, the buffer can't be used for anything else until
> the call finishes.

Ah,  now I see what you and Greg are asking.   See my next imbed below.

>
> In particular, consider the situation from another backend's point of
> view. Looking from another backend (i.e. one that didn't initiate the
> read), there's no difference between a synchronous and asynchronous
> read. So why do we need a different IPC mechanism for the synchronous
> and asynchronous cases? We don't.
>
> I understand that *within the backend*, you need to somehow track the
> I/O, and you'll need to treat synchronous and asynchronous I/Os
> differently. But that's all within the same backend, and doesn't need to
> involve the flags or locks in shared memory at all. The inter-process
> communication doesn't need any changes.
>
>>> The problem with using the Buffers I/O in progress bit is that the I/O
>>> might complete while the other backend is busy doing stuff. As long as
>>> you can handle the I/O completion promptly -- either in callback or
>>> thread or signal handler then that wouldn't matter. But I'm not clear
>>> that any of those will work reliably.
>>
>> They both work reliably, but the criticism was that backend B polling
>> an aiocb of an aio issued by backend A is not documented as
>> being supported (although it happens to work), hence the proposed
>> change to use sigevent.
>
> You didn't understand what Greg meant. You need to handle the completion
> of the I/O in the same process that initiated it, by clearing the
> in-progress bit of the buffer and releasing the I/O in-progress lwlock
> on it. And you need to do that very quickly after the I/O has finished,
> because there might be another backend waiting for the buffer and you
> don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details earlier.
Let me explain a little more.
With this patch,     when read is issued,   it is either a synchronous IO
(as before),  or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and*  BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any*  backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which,  depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way.    
 
>
> The question is, if you receive the notification of the I/O completion
> using a signal or a thread, is it safe to release the lwlock from the
> signal handler or a separate thread?

In the forthcoming  new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled,  in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr,  the backend walks this process-local
queue and releases those LWlocks.    This is also done if the originator
itself issues a ReadBuffer,  which is the most frequent case in which it
is released.

Meanwhile,  any other backend will simply acquire Shared and release.

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header  could be used for this,  because if there is a aio in progress,
then that lock cannot be in use for synchronous IO.  I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
 but they could both use the same LWlock.   However,   the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

>
>> By the way, on the "will it actually work though?" question which several folks
>> have raised, I should mention that this patch has been in semi-production
>> use for almost 2 years now in different stages of completion on all postgresql
>> releases from 9.1.4 to 9.5 devel. I would guess it has had around
>> 500 hours of operation by now. I'm sure there are bugs still to be
>> found but I am confident it is fundamentally sound.
>
> Well, a committable version of this patch is going to look quite
> different from the first version that you posted, so I don't put much
> weight on how long you've tested the first version.

Yes,  I am quite willing to change it,  time permitting.
I take the works "committable version" as a positive sign ...


>
> - Heikki


Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 06/24/2014 06:08 PM, John Lumby wrote:
>> The question is, if you receive the notification of the I/O completion
>> using a signal or a thread, is it safe to release the lwlock from the
>> signal handler or a separate thread?
>
> In the forthcoming  new version of the patch that uses sigevent,
> the originator locks a LWlock associated with that BAaiocb eXclusive,
> and ,   when signalled,  in the signal handler it places that LWlock
> on a process-local queue of LWlocks awaiting release.
> (No, It cannot be safely released inside the signal handler or in a
> separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
> and at a few additional points in bufmgr,  the backend walks this process-local
> queue and releases those LWlocks.    This is also done if the originator
> itself issues a ReadBuffer,  which is the most frequent case in which it
> is released.
>
> Meanwhile,  any other backend will simply acquire Shared and release.

Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast 
enough? We have never made any hard guarantees on how often 
CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd 
party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls 
for many seconds, or even more. That's a long time to hold onto a buffer 
I/O lock. I don't think that's acceptable :-(.

> I think you are right that the existing io_in_progress_lock LWlock in the
> buf_header  could be used for this,  because if there is a aio in progress,
> then that lock cannot be in use for synchronous IO.  I chose not to use it
> because I preferred to keep the wait/post for asynch io separate,
>   but they could both use the same LWlock.   However,   the way the LWlock
> is acquired and released would still be a bit different because of the need
> to have the originator release it in its mainline.

It would be nice to use the same LWLock.

However, if releasing a regular LWLock in a signal handler is not safe, 
and cannot be made safe, perhaps we should, after all, invent a whole 
new mechanism. One that would make it safe to release the lock in a 
signal handler.

>>> By the way, on the "will it actually work though?" question which several folks
>>> have raised, I should mention that this patch has been in semi-production
>>> use for almost 2 years now in different stages of completion on all postgresql
>>> releases from 9.1.4 to 9.5 devel. I would guess it has had around
>>> 500 hours of operation by now. I'm sure there are bugs still to be
>>> found but I am confident it is fundamentally sound.
>>
>> Well, a committable version of this patch is going to look quite
>> different from the first version that you posted, so I don't put much
>> weight on how long you've tested the first version.
>
> Yes,  I am quite willing to change it,  time permitting.
> I take the works "committable version" as a positive sign ...

BTW, sorry if I sound negative, I'm actually quite excited about this 
feature. A patch like this take a lot of work, and usually several 
rewrites, until it's ready ;-). But I'm looking forward for it.

- Heikki




Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote:
>> The question is, if you receive the notification of the I/O completion
>> using a signal or a thread, is it safe to release the lwlock from the
>> signal handler or a separate thread?
>
> In the forthcoming  new version of the patch that uses sigevent,
> the originator locks a LWlock associated with that BAaiocb eXclusive,
> and ,   when signalled,  in the signal handler it places that LWlock
> on a process-local queue of LWlocks awaiting release.
> (No, It cannot be safely released inside the signal handler or in a
> separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
> and at a few additional points in bufmgr,  the backend walks this process-local
> queue and releases those LWlocks.    This is also done if the originator
> itself issues a ReadBuffer,  which is the most frequent case in which it
> is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

The problem with the process-local queue is that it may take time to
be processed (the time it takes to get to a CHECK_INTERRUPTS macro,
which as it happened with regexes, it can be quite high).



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 06/25/2014 09:20 PM, Claudio Freire wrote:
> On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote:
>>> The question is, if you receive the notification of the I/O completion
>>> using a signal or a thread, is it safe to release the lwlock from the
>>> signal handler or a separate thread?
>>
>> In the forthcoming  new version of the patch that uses sigevent,
>> the originator locks a LWlock associated with that BAaiocb eXclusive,
>> and ,   when signalled,  in the signal handler it places that LWlock
>> on a process-local queue of LWlocks awaiting release.
>> (No, It cannot be safely released inside the signal handler or in a
>> separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
>> and at a few additional points in bufmgr,  the backend walks this process-local
>> queue and releases those LWlocks.    This is also done if the originator
>> itself issues a ReadBuffer,  which is the most frequent case in which it
>> is released.
>
> I suggest using a semaphore instead.
>
> Semaphores are supposed to be incremented/decremented from multiple
> threads or processes already. So, in theory, the callback itself
> should be able to do it.

LWLocks are implemented with semaphores, so if you can increment a 
semaphore in the signal handler / callback thread, then in theory you 
should be able to release a LWLock. You'll need some additional 
synchronization within the same process, to make sure you don't release 
a lock in signal handler while you're just doing the same thing in the 
main thread. I'm not sure I want to buy into the notion that 
LWLockRelease must be generally safe to call from a signal handler, but 
maybe it's possible to have a variant of it that is.

On Linux at least we use System V semaphores, which are (unsurpisingly) 
not listed as safe for using in a signal handler. See list 
Async-signal-safe functions in signal(7) man page. The function used to 
increment a POSIX semaphore, sem_post(), is in the list, however.

- Heikki




Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Andres Freund
Date:
On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:
> LWLocks are implemented with semaphores, so if you can increment a semaphore
> in the signal handler / callback thread, then in theory you should be able
> to release a LWLock.

I don't think that's a convincing argument even if semop et al were
signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
anything with the same spinlock both inside and outside a signal
handler.
I don't think we're going to get lwlock.c/LWLockRelease to work
reasonably from a signal handler without a lot of work.

> On Linux at least we use System V semaphores, which are (unsurpisingly) not
> listed as safe for using in a signal handler. See list Async-signal-safe
> functions in signal(7) man page. The function used to increment a POSIX
> semaphore, sem_post(), is in the list, however.

Heh, just wrote the same after reading the beginning of your email ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
I am attaching the new version of the patch with support for use of sigevent
to report completion of asynch io and the new LWlock for waiters to
wait for originator to release it.

This patch is based on up-to-date git head but the asyncio design is
a bit behind the curve of recent discussions here.   Specifically,
sigevent is still optional (see README for how to configure it)
and non-originators still refer to the originator's aiocb to check the
completion code  (but not for waiting for it to complete).

All sigevent-related code is marked by
#ifdef USE_AIO_SIGEVENT
to make it easier to find.

Also although it "works" it is not functionally complete  -
search for FIXME in
src/backend/storage/buffer/buf_async.c
and has not had nearly as much testing.

________________________________________________________________________

I will try to paste earlier points and imbed comments :


Heikki wrote :

> Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast
> enough? We have never made any hard guarantees on how often
> CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd
> party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls
> for many seconds, or even more. That's a long time to hold onto a buffer
> I/O lock. I don't think that's acceptable :-(.

true but remember this case is the very infrequent one,  less than 0.2 %
 (in my benchmark).

>
>> I think you are right that the existing io_in_progress_lock LWlock in the
>> buf_header  could be used for this,  because if there is a aio in progress,
>> then that lock cannot be in use for synchronous IO.  I chose not to use it
>> because I preferred to keep the wait/post for asynch io separate,
>>   but they could both use the same LWlock.   However,   the way the LWlock
>> is acquired and released would still be a bit different because of the need
>> to have the originator release it in its mainline.
>
> It would be nice to use the same LWLock.

I think that will work.

>
> However, if releasing a regular LWLock in a signal handler is not safe,
> and cannot be made safe, perhaps we should, after all, invent a whole
> new mechanism. One that would make it safe to release the lock in a
> signal handler.

It would take rewriting lwlock.c and still be devillish hard to test -
I would prefer not.

>
 >>
>>> Well, a committable version of this patch is going to look quite
>>> different from the first version that you posted, so I don't put much
>>> weight on how long you've tested the first version.
>>
>> Yes,  I am quite willing to change it,  time permitting.
>> I take the works "committable version" as a positive sign ...
>
> BTW, sorry if I sound negative, I'm actually quite excited about this
> feature. A patch like this take a lot of work, and usually several
> rewrites, until it's ready ;-). But I'm looking forward for it.

I am grateful you spent the time to study it.    
knowing the code,  I am not surprised it made you a bit grumpy ...


________________________________________________________

discussion about what is the difference between a synchronous read
versus an asynchronous read as far as non-originator waiting on it is concerned.

I thought a bit more about this.   There are currently two differences,
one of which can easily be changed and one not so easy.

1)     The current code,  even with sigevent,  still makes the non-originator waiter
         call aio_error on the originator's aiocb to get the completion code.
         For sigevent variation,  easily changed to have the originator always call aio_error
         (from its CHECK_INTERRUPTS or from FIleCompleteaio)        
         and store that in the BAiocb.
         My idea of why not to do that  was that,  by having the non-originator check the aiocb,
        this would allow the waiter to proceed sooner.   But for a different reason it actually
         doesn't.   (The non-originator must still wait for the LWlock release)

  2)   Buffer pinning and  returning the BufferAiocb to the free list
        With synchronous IO,    each backend that calls a ReadBuffer must pin the buffer
        early in the process.
        With asynchronous IO,    initially only the originator gets the pin
        (and that is during PrefetchBuffer,  not Readbuffer)
         When the aio completes and some backend checks that completion,
        then the backend has various responsibilities:

               .   pin the buffer if it did not already have one (from prefetch)
               .  if it was the last such backend to make that check
                  (amongst the cohort waiting on it)
                   then pin the buffer if it did not already have one (from prefetch)

        Those functions are different from the synchronous IO case.
        I think code clarity alone may dictate keeping these separate.

___________________________________________________________________
various discussion of semaphores and LWlocks:


----------------------------------------
> Date: Wed, 25 Jun 2014 23:17:53 +0200
> From: andres@2ndquadrant.com
> To: hlinnakangas@vmware.com
> CC: klaussfreire@gmail.com; johnlumby@hotmail.com; stark@mit.edu; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:
>> LWLocks are implemented with semaphores, so if you can increment a semaphore
>> in the signal handler / callback thread, then in theory you should be able
>> to release a LWLock.
>
> I don't think that's a convincing argument even if semop et al were
> signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
> anything with the same spinlock both inside and outside a signal
> handler.
> I don't think we're going to get lwlock.c/LWLockRelease to work
> reasonably from a signal handler without a lot of work.

I agree -  see earlier.




Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
My cut'n'pasting failed me at one point corrected below.


> discussion about what is the difference between a synchronous read
> versus an asynchronous read as far as non-originator waiting on it is concerned.
>
> I thought a bit more about this.   There are currently two differences,
> one of which can easily be changed and one not so easy.
>
> 1)     The current code,  even with sigevent,  still makes the non-originator waiter
>          call aio_error on the originator's aiocb to get the completion code.
>          For sigevent variation,  easily changed to have the originator always call aio_error
>          (from its CHECK_INTERRUPTS or from FIleCompleteaio)
>          and store that in the BAiocb.
>          My idea of why not to do that  was that,  by having the non-originator check the aiocb,
>         this would allow the waiter to proceed sooner.   But for a different reason it actually
>          doesn't.   (The non-originator must still wait for the LWlock release)
>
>   2)   Buffer pinning and  returning the BufferAiocb to the free list
>         With synchronous IO,    each backend that calls a ReadBuffer must pin the buffer
>         early in the process.
>         With asynchronous IO,    initially only the originator gets the pin
>         (and that is during PrefetchBuffer,  not Readbuffer)
>          When the aio completes and some backend checks that completion,
>         then the backend has various responsibilities:
>
>                .   pin the buffer if it did not already have one (from prefetch)
>                .  if it was the last such backend to make that check
>                   (amongst the cohort waiting on it)
>                    then XXXXXXpin the buffer if it did not already have one (from prefetch)XXXX

then return the BufferAiocb to the free list




Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
John Lumby
Date:
I am attaching a new version of the patch for consideration in the current commit fest.

Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl
the method for handling aio completion using sigevent has been re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock method.
More details are in postgresql-prefetching-asyncio.README
(search for   
     "Checking AIO Completion"
)

I also have worked my benchmark database and one application into a publishable state.
However,   the database is in the form of a compressed pg_dump and is around
218 MB in size.    I was hoping to come up with a generation program to load it
but have not had the time for that.     Is there some place on postgresql.org for
such a large file?   If not I will try to think of some place for it.

John Lumby 




Attachment

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 08/20/2014 12:17 AM, John Lumby wrote:
> I am attaching a new version of the patch for consideration in the current commit fest.

Thanks for working on this!

> Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl
> the method for handling aio completion using sigevent has been re-written to use
> signals exclusively rather than a composite of signals and LWlocks,
> and this has fixed the problem I mentioned before with the LWlock method.

ISTM the patch is still allocating stuff in shared memory that really 
doesn't belong there. Namely, the BufferAiocb structs. Or at least parts 
of it; there's also a waiter queue there which probably needs to live in 
shared memory, but the rest of it does not.

At least BufAWaitAioCompletion is still calling aio_error() on an AIO 
request that might've been initiated by another backend. That's not OK.

Please write the patch without atomic CAS operation. Just use a 
spinlock. There's a patch in the commitfest to add support for that, but 
it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 
ifdefs make the patch more difficult to read. Same with all the other 
#ifdefs; please just pick a method that works.

Also, please split prefetching of regular index scans into a separate 
patch. It's orthogonal to doing async I/O; we could prefetch regular 
index scans with posix_fadvise too, and AIO should be useful for 
prefetching other stuff.

- Heikki




Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Claudio Freire
Date:
On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Also, please split prefetching of regular index scans into a separate patch.
> It's orthogonal to doing async I/O; we could prefetch regular index scans
> with posix_fadvise too, and AIO should be useful for prefetching other
> stuff.

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.



Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
johnlumby
Date:
<div class="moz-cite-prefix"><font face="Nimbus Roman No9 L">Thanks for the replies and thoughts.</font><br /><br /> On
08/19/1418:27, Heikki Linnakangas wrote:<br /></div><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite">On
08/20/201412:17 AM, John Lumby wrote: <br /><blockquote type="cite">I am attaching a new version of the patch for
considerationin the current commit fest. <br /></blockquote><br /> Thanks for working on this! <br /><br /><blockquote
type="cite">Relativeto the one I submitted on 25 June in <a class="moz-txt-link-abbreviated"
href="mailto:BAY175-W412FF89303686022A9F16AA3190@phx.gbl">BAY175-W412FF89303686022A9F16AA3190@phx.gbl</a><br/> the
methodfor handling aio completion using sigevent has been re-written to use <br /> signals exclusively rather than a
compositeof signals and LWlocks, <br /> and this has fixed the problem I mentioned before with the LWlock method. <br
/></blockquote><br/> ISTM the patch is still allocating stuff in shared memory that really doesn't belong there.
Namely,the BufferAiocb structs. Or at least parts of it; there's also a waiter queue there which probably needs to live
inshared memory, but the rest of it does not. <br /></blockquote><br /><font face="Nimbus Roman No9 L">Actually the
reasonthe BufferAiocb ( the postgresql block corresponding to the aio's aiocb)<br /> must be located in shared memory
isthat, as you said,  it acts as anchor for the waiter list.<br /> See further comment below.</font><br /><br
/><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"><br /> At least BufAWaitAioCompletion is still calling
aio_error()on an AIO request that might've been initiated by another backend. That's not OK. <br /></blockquote><br
/><fontface="Nimbus Roman No9 L">Yes,   you are right,  and I agree with this one  -<br /> I will add a
aio_error_return_codefield in the </font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9
L">BufferAiocb</font><br/> and only the originator will set this from the real aiocb.</font><br /><br /><blockquote
cite="mid:53F3CF5B.4000607@vmware.com"type="cite"><br /> Please write the patch without atomic CAS operation. Just use
aspinlock. </blockquote><br /><font face="Nimbus Roman No9 L">Umm,   this is a new criticism I think.   I use CAS for
thingsother than locking,<br /> such as add/remove from shared queue.   I suppose maybe a spinlock on the entire
queue<br/> can be used equivalently,  but with more code (extra confusion) and worse performance<br /> (coarser
serialization).     What is your objection to using gcc's atomic ops?   Portability?</font><br /><br /><br
/><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite">There's a patch in the commitfest to add support for
that,</blockquote><br/><font face="Nimbus Roman No9 L">sorry,  support for what?       There are already spinlocks in
postgresql,  <br /> you mean some new kind?    please point me at it with hacker msgid or something.</font><br /><br
/><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"> but it's not committed yet, and all those
USE_AIO_ATOMIC_BUILTIN_COMP_SWAPifdefs make the patch more difficult to read. Same with all the other #ifdefs; please
justpick a method that works. <br /></blockquote><font face="Nimbus Roman No9 L"><br /> Ok,  yes,   the ifdefs are
unpleasant.   I will do something about that.<br /> Ideally they would be entirely confined to header files and only
macrofunctions<br /> in C files  -  maybe I can do that.   And eventually when the dust has settled<br /> eliminate
obsoleteifdef blocks altogether.</font><br /><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite">Also,
pleasesplit prefetching of regular index scans into a separate patch. It's orthogonal to doing async
I/O;</blockquote><br/><font face="Nimbus Roman No9 L">actually not completely orthogonal,  see next</font><br /><br
/><blockquotecite="mid:53F3CF5B.4000607@vmware.com" type="cite"> we could prefetch regular index scans with
posix_fadvisetoo, and AIO should be useful for prefetching other stuff. <br /></blockquote><br /><div
class="moz-cite-prefix">On08/19/14 19:10, Claudio Freire wrote:<br /></div><blockquote
cite="mid:CAGTBQpbrHCJvkCmXmA8zHmfKNLLtqbgagLaYA+Z9E-F0oY2EtQ@mail.gmail.com"type="cite"><pre wrap="">On Tue, Aug 19,
2014at 7:27 PM, Heikki Linnakangas
 
<a class="moz-txt-link-rfc2396E" href="mailto:hlinnakangas@vmware.com"><hlinnakangas@vmware.com></a> wrote:
</pre><blockquote type="cite"><pre wrap="">Also, please split prefetching of regular index scans into a separate patch.
...
</pre></blockquote><pre wrap="">That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.</pre></blockquote><font face="Nimbus Roman No9
L">Severalpeople have asked to split this patch into several smaller ones and I<br /> have thought about it.     It
wouldintroduce some awkward dependencies.<br /> E.g.  splitting the scanner code  (index,  relation-heap) into separate
patch<br/> from aio code would mean that the scanner patch becomes dependent<br /> on the aio patch.     They are not
quiteorthogonal.<br /><br /> The reason is that the scanners call a new function, DiscardBuffer(blockid)<br /> when aio
isin use.     We can get around it by providing a stub for that function<br /> in the scanner patch,   but then there
issome risk of someone getting the<br /> wrong version of that function in their build.     It just adds yet more
complexity<br/> and breakage opportunities.<br /></font><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com"
type="cite"><br/> - Heikki <br /><br /></blockquote><font face="Nimbus Roman No9 L">One further comment concerning
these</font><font face="Nimbus Roman No9 L"><font face="Nimbus Roman No9       L">BufferAiocb</font> and aiocb control
blocks<br/> being in shared memory :<br /><br /> I explained above that the </font><font face="Nimbus Roman No9
L"><fontface="Nimbus Roman No9 L">BufferAiocb</font> must be in shared memory for wait/post.<br /> The aiocb does not
necessarilyhave to be in shared memory,<br /> but since there is a one-to-one relationship between </font><font
face="NimbusRoman No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> and aiocb,<br /> it makes the code much
simpler,  and the operation much more efficient,<br /> if the aiocb is embedded in the </font><font face="Nimbus Roman
  No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> as I have it now.<br /> E.g. if the aiocb is in
private-processmemory,   then an additional allocation<br /> scheme is needed (fixed number?   palloc()in'g extra ones
asneeded?  ...)<br /> which adds complexity,   and probably some inefficiency since a shared pool is usually<br /> more
efficient(allows higher maximum per process etc),   and there would have to be<br /> some pointer de-referencing from
</font><fontface="Nimbus Roman     No9 L"><font face="Nimbus Roman No9 L">BufferAiocb</font> to aiocb adding some
(small)overhead.<br /><br /> I understood your objection to use of shared memory as being that you don't want<br /> a
non-originatorto access the originator's aiocb using aio_xxx calls,   and,     with the<br /> one change I've said I
willdo above (put the aio_error retcode in the </font><font face="Nimbus Roman No9 L"><font face="Nimbus       Roman
No9L">BufferAiocb</font>)<br /> I will have achieved that requirement.   I am hoping this answers your objections<br />
concerningshared memory.<br /></font><br /><br /><blockquote cite="mid:53F3CF5B.4000607@vmware.com" type="cite"><br
/><br/></blockquote><br /> 

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From
Heikki Linnakangas
Date:
On 08/25/2014 12:49 AM, johnlumby wrote:
> On 08/19/14 18:27, Heikki Linnakangas wrote:
>> Please write the patch without atomic CAS operation. Just use a spinlock.
>
> Umm,   this is a new criticism I think.

Yeah. Be prepared that new issues will crop up as the patch gets slimmer 
and easier to review :-). Right now there's still so much chaff that 
it's difficult to see the wheat.

>   I use CAS for things other
> than locking,
> such as add/remove from shared queue.   I suppose maybe a spinlock on
> the entire queue
> can be used equivalently,  but with more code (extra confusion) and
> worse performance
> (coarser serialization).      What is your objection to using gcc's
> atomic ops?   Portability?

Yeah, portability.

Atomic ops might make sense, but at this stage it's important to slim 
down the patch to the bare minimum, so that it's easier to review. Once 
the initial patch is in, you can improve it with additional patches.

>> There's a patch in the commitfest to add support for that,
>
> sorry,  support for what? There are already spinlocks in postgresql,
> you mean some new kind?    please point me at it with hacker msgid or
> something.

Atomic ops: https://commitfest.postgresql.org/action/patch_view?id=1314

Once that's committed, you can use the new atomic ops in your patch. But 
until then, stick to spinlocks.

> On 08/19/14 19:10, Claudio Freire wrote:
>> On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> Also, please split prefetching of regular index scans into a separate patch. ...
>> That patch already happened on the list, and it wasn't a win in many
>> cases. I'm not sure it should be proposed independently of this one.
>> Maybe a separate patch, but it should be considered dependent on this.
>>
>> I don't have an archive link at hand atm, but I could produce one later.
> Several people have asked to split this patch into several smaller ones
> and I
> have thought about it.     It would introduce some awkward dependencies.
> E.g.  splitting the scanner code  (index,  relation-heap) into separate
> patch
> from aio code would mean that the scanner patch becomes dependent
> on the aio patch.     They are not quite orthogonal.

Right now, please focus on the main AIO patch. That should show a 
benefit for bitmap heap scans too, so to demonstrate the benefits of 
AIO, you don't need to prefetch regular index scans. The main AIO patch 
can be written, performance tested, and reviewed without caring about 
regular index scans at all.

> The reason is that the scanners call a new function, DiscardBuffer(blockid)
> when aio is in use.     We can get around it by providing a stub for
> that function
> in the scanner patch,   but then there is some risk of someone getting the
> wrong version of that function in their build.     It just adds yet more
> complexity
> and breakage opportunities.

Regardless of the regular index scans, we'll need to discuss the new API 
of PrefetchBuffer and DiscardBuffer.

It would be simpler for the callers if you didn't require the 
DiscardBuffer calls. I think it would be totally feasible to write the 
patch that way. Just drop the buffer pin after the asynchronous read 
finishes. When the caller actually needs the page, it will call 
ReadBuffer which will pin it again. You'll get a little bit more bufmgr 
traffic that way, but I think it'll be fine in practice.

> One further comment concerning these BufferAiocb and aiocb control blocks
> being in shared memory :
>
> I explained above that the BufferAiocb must be in shared memory for
> wait/post.
> The aiocb does not necessarily have to be in shared memory,
> but since there is a one-to-one relationship between BufferAiocb and aiocb,
> it makes the code much simpler ,  and the operation much more efficient,
> if the aiocb is embedded in the BufferAiocb as I have it now.
> E.g. if the aiocb is in private-process memory,   then an additional
> allocation
> scheme is needed (fixed number?   palloc()in'g extra ones as needed?  ...)
> which adds complexity,

Yep, use palloc or a fixed pool. There's nothing wrong with that.

>  and probably some inefficiency since a shared
> pool is usually
> more efficient (allows higher maximum per process etc),   and there
> would have to be
> some pointer de-referencing from BufferAiocb to aiocb adding some
> (small) overhead.

I think you're falling into the domain of premature optimization. A few 
pointer dereferences are totally insignificant, and the amount of memory 
you're saving pales in comparison to other similar non-shared pools and 
caches we have (catalog caches, for example). And on the other side of 
the coin, with a shared pool you'll waste memory when async I/O is not 
used (e.g because everything fits in RAM), and when it is used, you'll 
have more contention on locks and cache lines when multiple processes 
use the same objects.

The general guideline in PostgreSQL is that everything is 
backend-private, except structures used to communicate between backends.

- Heikki