Re: Extended Prefetching using Asynchronous IO - proposal and patch - Mailing list pgsql-hackers

From John Lumby
Subject Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date
Msg-id BAY175-W526880A96F1B36CD529D4BA31E0@phx.gbl
Whole thread Raw
In response to Re: Extended Prefetching using Asynchronous IO - proposal and patch  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Extended Prefetching using Asynchronous IO - proposal and patch  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Extended Prefetching using Asynchronous IO - proposal and patch  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
Thanks Heikki,

----------------------------------------
> Date: Tue, 24 Jun 2014 17:02:38 +0300
> From: hlinnakangas@vmware.com
> To: johnlumby@hotmail.com; stark@mit.edu
> CC: klaussfreire@gmail.com; pgsql-hackers@postgresql.org
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 06/24/2014 04:29 PM, John Lumby wrote:
>>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:
>>>> It is when some *other* backend gets there first with the ReadBuffer that
>>>> things are a bit trickier. The current version of the patch did polling for that case
>>>> but that drew criticism, and so an imminent new version of the patch
>>>> uses the sigevent mechanism. And there are other ways still.
>>>
>>> I'm a bit puzzled by this though. Postgres *already* has code for this
>>> case. When you call ReadBuffer you set the bits on the buffer
>>
>> Good question. Let me explain.
>> Yes, postgresql has code for the case of a backend is inside a synchronous
>> read() or write(), performed from a ReadBuffer(), and some other backend
>> wants that buffer. asynchronous aio is initiated not from ReadBuffer
>> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
>> postgresql buffer. This is entirely different from the synchronous io case.
>> Why? Because the issuer of the aio_read (the "originator") is unaware
>> of this buffer pinned on its behalf, and is then free to do any other
>> reading or writing it wishes, such as more prefetching or any other operation.
>> And furthermore, it may *never* issue a ReadBuffer for the block which it
>> prefetched.
>
> I still don't see the difference. Once an asynchronous read is initiated
> on the buffer, it can't be used for anything else until the read has
> finished. This is exactly the same situation as with a synchronous read:
> after read() is called, the buffer can't be used for anything else until
> the call finishes.

Ah,  now I see what you and Greg are asking.   See my next imbed below.

>
> In particular, consider the situation from another backend's point of
> view. Looking from another backend (i.e. one that didn't initiate the
> read), there's no difference between a synchronous and asynchronous
> read. So why do we need a different IPC mechanism for the synchronous
> and asynchronous cases? We don't.
>
> I understand that *within the backend*, you need to somehow track the
> I/O, and you'll need to treat synchronous and asynchronous I/Os
> differently. But that's all within the same backend, and doesn't need to
> involve the flags or locks in shared memory at all. The inter-process
> communication doesn't need any changes.
>
>>> The problem with using the Buffers I/O in progress bit is that the I/O
>>> might complete while the other backend is busy doing stuff. As long as
>>> you can handle the I/O completion promptly -- either in callback or
>>> thread or signal handler then that wouldn't matter. But I'm not clear
>>> that any of those will work reliably.
>>
>> They both work reliably, but the criticism was that backend B polling
>> an aiocb of an aio issued by backend A is not documented as
>> being supported (although it happens to work), hence the proposed
>> change to use sigevent.
>
> You didn't understand what Greg meant. You need to handle the completion
> of the I/O in the same process that initiated it, by clearing the
> in-progress bit of the buffer and releasing the I/O in-progress lwlock
> on it. And you need to do that very quickly after the I/O has finished,
> because there might be another backend waiting for the buffer and you
> don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details earlier.
Let me explain a little more.
With this patch,     when read is issued,   it is either a synchronous IO
(as before),  or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and*  BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any*  backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which,  depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way.    
 
>
> The question is, if you receive the notification of the I/O completion
> using a signal or a thread, is it safe to release the lwlock from the
> signal handler or a separate thread?

In the forthcoming  new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled,  in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr,  the backend walks this process-local
queue and releases those LWlocks.    This is also done if the originator
itself issues a ReadBuffer,  which is the most frequent case in which it
is released.

Meanwhile,  any other backend will simply acquire Shared and release.

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header  could be used for this,  because if there is a aio in progress,
then that lock cannot be in use for synchronous IO.  I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
 but they could both use the same LWlock.   However,   the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

>
>> By the way, on the "will it actually work though?" question which several folks
>> have raised, I should mention that this patch has been in semi-production
>> use for almost 2 years now in different stages of completion on all postgresql
>> releases from 9.1.4 to 9.5 devel. I would guess it has had around
>> 500 hours of operation by now. I'm sure there are bugs still to be
>> found but I am confident it is fundamentally sound.
>
> Well, a committable version of this patch is going to look quite
> different from the first version that you posted, so I don't put much
> weight on how long you've tested the first version.

Yes,  I am quite willing to change it,  time permitting.
I take the works "committable version" as a positive sign ...


>
> - Heikki


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: idle_in_transaction_timeout
Next
From: David G Johnston
Date:
Subject: Re: idle_in_transaction_timeout