Thread: Re: AIO v2.0

Re: AIO v2.0

From
Heikki Linnakangas
Date:
On 01/09/2024 09:27, Andres Freund wrote:
> The main reason I had previously implemented WAL AIO etc was to know the
> design implications - but now that they're somewhat understood, I'm planning
> to keep the patchset much smaller, with the goal of making it upstreamable.

+1 on that approach.

> To solve the issue with an unbounded number of AIO references there are few
> changes compared to the prior approach:
> 
> 1) Only one AIO handle can be "handed out" to a backend, without being
>     defined. Previously the process of getting an AIO handle wasn't super
>     lightweight, which made it appealing to cache AIO handles - which was one
>     part of the problem for running out of AIO handles.
> 
> 2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
>     valid operation) to stay around, it's always possible to execute the AIO
>     operation and then reuse the handle.  This provides a forward guarantee, by
>     ensuring that completing AIOs can free up handles (previously they couldn't
>     be reused until the backend local reference was released).
> 
> 3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
>     take the server down.
> 
> 4) Obviously some code needs to know the result of AIO operation and be able
>     to error out. To allow for that the issuer of an AIO can provide a pointer
>     to local memory that'll receive the result of an AIO, including details
>     about what kind of errors occurred (possible errors are e.g. a read failing
>     or a buffer's checksum validation failing).
> 
> 
> In the next few days I'll add a bunch more documentation and comments as well
> as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to 
follow what "handed out" and "defined" above means for example.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired 
the lock, but it doesn't own it anymore. That's reasonable, but maybe 
add a boolean to the LWLock to mark whether the lock is currently owned 
or not.

The LWLockReleaseOwnership() name is a bit confusing together with 
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might 
think that they all release the lock, but LWLockReleaseOwnership() just 
disassociates it from the current process. Rename it to LWLockDisown() 
perhaps.

v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch

+1. The old comment "We don't currently need any ResourceOwner in a 
walsender process" was a bit misleading, because the walsender did 
create the short-lived "base backup" resource owner, so it's nice to get 
that fixed.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict 
with this [1]. Not in any fundamental way, but can I ask you to review 
those patch, please? After those patches, AIO workers should also have 
PMChild slots (formerly known as Backend structs).

[1] 
https://www.postgresql.org/message-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: AIO v2.0

From
Andres Freund
Date:
Hi,

On 2024-09-02 13:03:07 +0300, Heikki Linnakangas wrote:
> On 01/09/2024 09:27, Andres Freund wrote:
> > In the next few days I'll add a bunch more documentation and comments as well
> > as some better perf numbers (assuming my workstation survived...).
> 
> Yeah, a high-level README would be nice. Without that, it's hard to follow
> what "handed out" and "defined" above means for example.

Yea - I had actually written a bunch of that before, but then redesigns just
obsoleted most of it :(

FWIW, "handed out" is an IO handle acquired by code, which doesn't yet have an
operation associated with it. Once "defined" it actually could be - but isn't
yet - executed.


> A few quick comments the patches:
> 
> v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch
> 
> +1, this seems ready to be committed right away.

Cool


> v2.0-0002-Allow-lwlocks-to-be-unowned.patch
> 
> With LOCK_DEBUG, LWLock->owner will point to the backend that acquired the
> lock, but it doesn't own it anymore. That's reasonable, but maybe add a
> boolean to the LWLock to mark whether the lock is currently owned or not.

Hm, not sure it's worth doing that...


> The LWLockReleaseOwnership() name is a bit confusing together with
> LWLockReleaseUnowned() and LWLockrelease(). From the names, you might think
> that they all release the lock, but LWLockReleaseOwnership() just
> disassociates it from the current process. Rename it to LWLockDisown()
> perhaps.

Yea, that makes sense.


> v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch
> 
> My refactoring around postmaster.c child process handling will conflict with
> this [1]. Not in any fundamental way, but can I ask you to review those
> patch, please? After those patches, AIO workers should also have PMChild
> slots (formerly known as Backend structs).

I'll try to do that soonish!

Greetings,

Andres Freund



Re: AIO v2.0

From
陈宗志
Date:
I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
For example, which paths were modified in the AIO module? Is it the
path for writing WAL logs, or the path for flushing pages, etc.?

Also, I recommend keeping this patch as small as possible.
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.



Re: AIO v2.0

From
Andres Freund
Date:
Hi,

On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:
> I hope there can be a high-level design document that includes a
> description, high-level architecture, and low-level design.
> This way, others can also participate in reviewing the code.

Yep, that was already on my todo list. The version I just posted includes
that.


> For example, which paths were modified in the AIO module?
> Is it the path for writing WAL logs, or the path for flushing pages, etc.?

I don't think it's good to document this in a design document - that's just
bound to get out of date.

For now the patchset causes AIO to be used for

1) all users of read_stream.h, e.g. sequential scans

2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
   mentioned in my email upthread, the code for that is in a somewhat rough
   shape as Thomas Munro is working on a more general abstraction for some of
   this.

The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO.  Thus I am
focusing on the above users for now.


> Also, I recommend keeping this patch as small as possible.

Yep. That's my goal (as mentioned upthread).


> For example, the first step could be to introduce libaio only, without
> considering io_uring, as that would make it too complex.

Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.

The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.

Greetings,

Andres Freund



Re: AIO v2.0

From
Andres Freund
Date:
Hi,

On 2024-09-12 14:55:49 -0700, Robert Pang wrote:
> Hi Andres
> 
> Thanks for the AIO patch update. I gave it a try and ran into a FATAL
> in bgwriter when executing a benchmark.
> 
> 2024-09-12 01:38:00.851 PDT [2780939] PANIC:  no more bbs
> 2024-09-12 01:38:00.854 PDT [2780473] LOG:  background writer process
> (PID 2780939) was terminated by signal 6: Aborted
> 2024-09-12 01:38:00.854 PDT [2780473] LOG:  terminating any other
> active server processes
> 
> I debugged a bit and found that BgBufferSync() is not capping the
> batch size under io_bounce_buffers like BufferSync() for checkpoint.
> Here is a small patch to fix it.

Good catch, thanks!


I am hoping (as described in my email to Noah a few minutes ago) that we can
get away from needing bounce buffers. They are a quite expensive solution to a
problem we made for ourselves...

Greetings,

Andres Freund