Thread: Re: AIO v2.3

Re: AIO v2.3

From

Thomas Munro

Date:

11 February, 01:48:38

On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)

> - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
>  such. I don't love that name though.

Problem statement: You want to be able to batch I/O submission, ie
make a single call to ioring_enter() (and other mechanisms) to start
several I/Os, but the code that submits is inside StartReadBuffers()
and the code that knows how many I/Os it wants to start now is at a
higher level, read_stream.c and in future elsewhere.  So you invented
this flag to tell StartReadBuffers() not to call
pgaio_submit_staged(), because you promise to do it later, via this
staging list.  Additionally, there is a kind of programming rule here
that you *must* submit I/Os that you stage, you aren't allowed to (for
example) stage I/Os and then sleep, so it has to be a fairly tight
piece of code.

Would the API be better like this?:  When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
Then bufmgr wouldn't need this flag: when it (or anything else) calls
smgrstartreadv(), if there is not currently an explicit batch then it
would be submitted immediately, and otherwise it would only be staged.
This way, batch construction (or whatever word you prefer for batch)
is in a clearly and explicitly demarcated stretch of code in one
lexical scope (though its effect is dynamically scoped just like the
staging list itself because we don't want to pass explicit I/O
contexts through the layers), but code that doesn't call those and
reaches AsyncReadBuffer() or whatever gets an implicit batch of size
one and that's also OK.  Not sure what semantics nesting would have
but I doubt it matters much.

> Things that need to be fixed / are fixed in this:
> - max pinned buffers should be limited by io_combine_limit, not * 4
> - overflow distance
> - pins need to be limited in more places

I have patches for these and a few more things and will post in a
separate thread shortly because they can be understood without
reference to this AIO stuff and that'll hopefully be more digestible.

+    /*
+     * In some rare-ish cases one operation causes multiple reads (e.g. if a
+     * buffer was concurrently read by another backend). It'd be much better
+     * if we ensured that each ReadBuffersOperation covered only one IO - but
+     * that's not entirely trivial, due to having pinned victim buffers before
+     * starting IOs.
+     *
+     * TODO: Change the API of StartReadBuffers() to ensure we only ever need
+     * one IO.

Likewise.

+    /* IO finished, but result has not yet been processed */
+    PGAIO_HS_COMPLETED_IO,
+
+    /* IO completed, shared completion has been called */
+    PGAIO_HS_COMPLETED_SHARED,
+
+    /* IO completed, local completion has been called */
+    PGAIO_HS_COMPLETED_LOCAL,

(Repeating something I mentioned in off-list bikeshedding)  I wondered
if it might be clearer to use the terminology "terminated" for the
work that PostgreSQL has to do after an I/O completes, instead of
overloading/subdividing the term "completed".  We already "terminate"
an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
this feels like a sort of generalisation of that notion.  In this AIO
world, some work is done by the backend that receives the completion
notification from the kernel, and some is done by the backend that
submitted the I/O in the first place, a division that doesn't exist
with simple synchronous system calls.  I wonder if it would be clearer
to use terms based on those two roles, rather than "shared" and
"local", leading to something like:

PGAIO_HS_COMPLETED,
PGAIO_HS_TERMINATED_BY_COMPLETER,
PGAIO_HS_TERMINATED_BY_SUBMITTER,

Re: AIO v2.3

From

Andres Freund

Date:

11 February, 01:54:22

Hi,

On 2025-02-11 11:48:38 +1300, Thomas Munro wrote:
> On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:
> > - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
> >  such. I don't love that name though.
> 
> Problem statement: You want to be able to batch I/O submission, ie
> make a single call to ioring_enter() (and other mechanisms) to start
> several I/Os, but the code that submits is inside StartReadBuffers()
> and the code that knows how many I/Os it wants to start now is at a
> higher level, read_stream.c and in future elsewhere.  So you invented
> this flag to tell StartReadBuffers() not to call
> pgaio_submit_staged(), because you promise to do it later, via this
> staging list.  Additionally, there is a kind of programming rule here
> that you *must* submit I/Os that you stage, you aren't allowed to (for
> example) stage I/Os and then sleep, so it has to be a fairly tight
> piece of code.
> 
> Would the API be better like this?:  When you want to create a batch
> of I/Os submitted together, you wrap the work in pgaio_begin_batch()
> and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
> Then bufmgr wouldn't need this flag: when it (or anything else) calls
> smgrstartreadv(), if there is not currently an explicit batch then it
> would be submitted immediately, and otherwise it would only be staged.
> This way, batch construction (or whatever word you prefer for batch)
> is in a clearly and explicitly demarcated stretch of code in one
> lexical scope (though its effect is dynamically scoped just like the
> staging list itself because we don't want to pass explicit I/O
> contexts through the layers), but code that doesn't call those and
> reaches AsyncReadBuffer() or whatever gets an implicit batch of size
> one and that's also OK.  Not sure what semantics nesting would have
> but I doubt it matters much.

I'm a bit unexcited about the work to redesign this, but I also admit that you
have a point :)

Linux calls a similar concept "plugging" the queue. I think I like "batch"
better, but only marginally.


> > Things that need to be fixed / are fixed in this:
> > - max pinned buffers should be limited by io_combine_limit, not * 4
> > - overflow distance
> > - pins need to be limited in more places
> 
> I have patches for these and a few more things and will post in a
> separate thread shortly because they can be understood without
> reference to this AIO stuff and that'll hopefully be more digestible.

Yay!


> +    /* IO finished, but result has not yet been processed */
> +    PGAIO_HS_COMPLETED_IO,
> +
> +    /* IO completed, shared completion has been called */
> +    PGAIO_HS_COMPLETED_SHARED,
> +
> +    /* IO completed, local completion has been called */
> +    PGAIO_HS_COMPLETED_LOCAL,
> 
> (Repeating something I mentioned in off-list bikeshedding)  I wondered
> if it might be clearer to use the terminology "terminated" for the
> work that PostgreSQL has to do after an I/O completes, instead of
> overloading/subdividing the term "completed".  We already "terminate"
> an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
> this feels like a sort of generalisation of that notion.

I have a mild laziness preference for complete over terminate, but not
more. If others agree with Thomas, I'm ok with changing it.


> In this AIO world, some work is done by the backend that receives the
> completion notification from the kernel, and some is done by the backend
> that submitted the I/O in the first place, a division that doesn't exist
> with simple synchronous system calls.  I wonder if it would be clearer to
> use terms based on those two roles, rather than "shared" and "local",
> leading to something like:
> 
> PGAIO_HS_COMPLETED,
> PGAIO_HS_TERMINATED_BY_COMPLETER,
> PGAIO_HS_TERMINATED_BY_SUBMITTER,

I don't love those, because the SHARED/LOCAL does imply more clearly what you
have access to. I.e. executing things in a shared completion callback for IO
on a temporary buffer doesn't make sense, you won't have access to the local
buffer table.

Greetings,

Andres Freund

Re: AIO v2.3

From

James Hunter

Date:

11 February, 20:10:51

On Mon, Feb 10, 2025 at 2:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
...
> Problem statement: You want to be able to batch I/O submission, ie
> make a single call to ioring_enter() (and other mechanisms) to start
> several I/Os, but the code that submits is inside StartReadBuffers()
> and the code that knows how many I/Os it wants to start now is at a
> higher level, read_stream.c and in future elsewhere.  So you invented
> this flag to tell StartReadBuffers() not to call
> pgaio_submit_staged(), because you promise to do it later, via this
> staging list.  Additionally, there is a kind of programming rule here
> that you *must* submit I/Os that you stage, you aren't allowed to (for
> example) stage I/Os and then sleep, so it has to be a fairly tight
> piece of code.
>
> Would the API be better like this?:  When you want to create a batch
> of I/Os submitted together, you wrap the work in pgaio_begin_batch()
> and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
> Then bufmgr wouldn't need this flag: when it (or anything else) calls
> smgrstartreadv(), if there is not currently an explicit batch then it
> would be submitted immediately, and otherwise it would only be staged.
> This way, batch construction (or whatever word you prefer for batch)
> is in a clearly and explicitly demarcated stretch of code in one
> lexical scope (though its effect is dynamically scoped just like the
> staging list itself because we don't want to pass explicit I/O
> contexts through the layers), but code that doesn't call those and
> reaches AsyncReadBuffer() or whatever gets an implicit batch of size
> one and that's also OK.  Not sure what semantics nesting would have
> but I doubt it matters much.

I like this idea. If we want to submit a batch, then just submit a batch.

James

Re: AIO v2.3

From

Robert Haas

Date:

11 February, 20:34:25

On Tue, Feb 11, 2025 at 12:11 PM James Hunter <james.hunter.pg@gmail.com> wrote:
> I like this idea. If we want to submit a batch, then just submit a batch.

Sounds good to me, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: AIO v2.3

From

Andres Freund

Date:

12 February, 00:43:22

Hi,

On 2025-02-11 11:48:38 +1300, Thomas Munro wrote:
> Would the API be better like this?:  When you want to create a batch
> of I/Os submitted together, you wrap the work in pgaio_begin_batch()
> and pgaio_submit_batch(), eg the loop in read_stream_lookahead().

One annoying detail is that an API like this would afaict need resowner
support or something along those lines (e.g. xact callbacks plus code in each
aux process' sigsetjmp() block). Otherwise I don't know how we would ensure
that the "batch-is-in-progress" flag/counter would get reset.

Alternatively we could make pgaio_batch_begin() basically start a critical
section, but that doesn't seem like a good idea, because too much that needs
to happen around buffered IO isn't compatible with critical sections.

Does anybody see a need for batches to be nested? I'm inclined to think that
that would be indicative of bugs and should therefore error/assert out.

One way we could avoid the need for a mechanism to reset-batch-in-progress
would be to make batch submission controlled by a flag on the IO. Something
like
    pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)

IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
submitted using something like the existing
    pgaio_submit_staged();
(although renaming it to something with batch in the name might be
appropriate)

That way there's no explicit "we are in a batch" state that needs to be reset
in case of errors.

Greetings,

Andres Freund

Re: AIO v2.3

From

Robert Haas

Date:

12 February, 21:00:22

On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:
> Alternatively we could make pgaio_batch_begin() basically start a critical
> section, but that doesn't seem like a good idea, because too much that needs
> to happen around buffered IO isn't compatible with critical sections.

A critical section sounds like a bad plan.

> Does anybody see a need for batches to be nested? I'm inclined to think that
> that would be indicative of bugs and should therefore error/assert out.

I can imagine somebody wanting to do it, but I think we can just say
no. I mean, it's no different from WAL record construction. There's no
theoretical reason you couldn't want to concurrently construct
multiple WAL records and then submit them one after another, but if
you want to do that, you have to do your own bookkeeping. It seems
fine to apply the same principle here.

> One way we could avoid the need for a mechanism to reset-batch-in-progress
> would be to make batch submission controlled by a flag on the IO. Something
> like
>     pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)
>
> IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
> submitted using something like the existing
>     pgaio_submit_staged();
> (although renaming it to something with batch in the name might be
> appropriate)
>
> That way there's no explicit "we are in a batch" state that needs to be reset
> in case of errors.

I'll defer to Thomas or others on whether this is better or worse,
because I don't know. It means that the individual I/Os have to know
that they are in a batch, which isn't necessary with the begin/end
batch interface. But if we're expecting that to happen in a pretty
confined amount of code -- similar to WAL record construction -- then
that might not be a problem anyway.

I think if you don't do this, I'd do (sub)xact callbacks rather than a
resowner integration, unless you decide you want to support multiple
concurrent batches. You don't really need or want to tie it to a
resowner unless there are multiple objects each of which can have its
own resources.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: AIO v2.3

From

James Hunter

Date:

12 February, 23:26:45

On Tue, Feb 11, 2025 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:
>
...

> Alternatively we could make pgaio_batch_begin() basically start a critical
> section, but that doesn't seem like a good idea, because too much that needs
> to happen around buffered IO isn't compatible with critical sections.
>
>
> Does anybody see a need for batches to be nested? I'm inclined to think that
> that would be indicative of bugs and should therefore error/assert out.

Fwiw, in a similar situation in the past, I just blocked, waiting for
the in-flight batch to complete, before sending the next batch. So I
had something like:

void begin_batch(...)
{
  if (batch_in_progress())
    complete_batch(...);

  /* ok start the batch now */
}

In my case, batches were nested because different access methods
(e.g., Index) can call/trigger other access methods (Heap), and both
access methods might want to issue batch reads. However, the "inner"
access method  might not be aware of the "outer" access method.

For simplicity, then, I just completed the outer batch. Note that this
is not optimal for performance (because a nested batch ends up
stalling the outer batch), but it does keep the code simple...

James

Re: AIO v2.3

From

Andres Freund

Date:

18 February, 00:01:42

Hi,

On 2025-02-12 13:00:22 -0500, Robert Haas wrote:
> On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:
> > One way we could avoid the need for a mechanism to reset-batch-in-progress
> > would be to make batch submission controlled by a flag on the IO. Something
> > like
> >     pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)
> >
> > IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
> > submitted using something like the existing
> >     pgaio_submit_staged();
> > (although renaming it to something with batch in the name might be
> > appropriate)
> >
> > That way there's no explicit "we are in a batch" state that needs to be reset
> > in case of errors.
> 
> I'll defer to Thomas or others on whether this is better or worse,
> because I don't know. It means that the individual I/Os have to know
> that they are in a batch, which isn't necessary with the begin/end
> batch interface. But if we're expecting that to happen in a pretty
> confined amount of code -- similar to WAL record construction -- then
> that might not be a problem anyway.
> 
> I think if you don't do this, I'd do (sub)xact callbacks rather than a
> resowner integration, unless you decide you want to support multiple
> concurrent batches. You don't really need or want to tie it to a
> resowner unless there are multiple objects each of which can have its
> own resources.

I have working code for that. There unfortunately is an annoying problem:

It afaict is not really possible to trigger a WARNING/ERRROR/Assert in xact
callbacks at the end of commands in an explicit transaction.  Consider
something like:

  BEGIN;
  SELECT start_but_not_end_aio_batch();

we would like to flag that the SELECT query started an AIO batch but didn't
end it. But at the momment there, afaict, is no proper way to do that, because
xact.c won't actually do much at the end of a command in a transaction block:

            /*
             * This is the case when we have finished executing a command
             * someplace within a transaction block.  We increment the command
             * counter and return.
             */
        case TBLOCK_INPROGRESS:
        case TBLOCK_IMPLICIT_INPROGRESS:
        case TBLOCK_SUBINPROGRESS:
            CommandCounterIncrement();
            break;

If one instead integrates with resowners, that kind of thing works, because
exec_simple_query() calls PortalDrop(), which in turn calls
ResourceOwnerRelease().

And we don't reliably warn at a later time. While xact.c integration triggers
a warning for:

  BEGIN;
  SELECT start_but_not_end_aio_batch();
  COMMIT;

as we'd still have the batch open at COMMIT, it wouldn't trigger for

  BEGIN;
  SELECT start_but_not_end_aio_batch();
  ROLLBACK;

as AbortTransaction() might be called in an error and therefore can't assume
that it's a problem for a batch to still be open.

I guess I could just put something alongside that CommandCounterIncrement()
call, but that doesn't seem right.  I guess putting it alongside the
ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't
seem great.

Just using resowners doesn't seem right either, it's not really free to
register something with resowners, and for read intensive IO we can start a
*lot* of batches, so doing unnecessary work isn't great.

Greetings,

Andres Freund

Re: AIO v2.3

From

Robert Haas

Date:

18 February, 22:19:25

On Mon, Feb 17, 2025 at 4:01 PM Andres Freund <andres@anarazel.de> wrote:
> If one instead integrates with resowners, that kind of thing works, because
> exec_simple_query() calls PortalDrop(), which in turn calls
> ResourceOwnerRelease().

Hmm, so maybe that's a reason to do it via resowner.c, then. The fact
that it's a singleton object is a bit annoying, but you could make it
not a singleton, and then either pass the relevant one to the
interface functions, or store the current one in a global variable
similar to CurrentMemoryContext or similar.

> I guess I could just put something alongside that CommandCounterIncrement()
> call, but that doesn't seem right.  I guess putting it alongside the
> ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't
> seem great.

The thing that's weird about that is that it isn't really logically
linked to the portal. It feels like it more properly belongs in
StartTransactionCommand() / CommitTransactionCommand().

> Just using resowners doesn't seem right either, it's not really free to
> register something with resowners, and for read intensive IO we can start a
> *lot* of batches, so doing unnecessary work isn't great.

You don't necessarily have to register a new object for every batch,
do you? You could just register one and keep reusing it for the
lifetime of the query.

--
Robert Haas
EDB: http://www.enterprisedb.com