Thread: Managing IO workers in postmaster's state machine

Managing IO workers in postmaster's state machine

From
Andres Freund
Date:
Hi,

For AIO, when using the worker process based "AIO", we need to manage a set of
workers. These need to be started by postmaster, obviously.

I have a few questions about the design details. They seemed mostly
independent of the larger AIO thread, so I started this separately.


1) Where to start/stop workers after a config reload?

The GUC that controls how many workers are allowed is PGC_SIGHUP. Therefore we
need to adjust the number of workers after a config reload.

Right now this is implemented (I think by Thomas, but it could have been me),
in a GUC assign hook. But that seems rather iffy.

For example, I think this means that, as we're in the middle of the GUC reload
process, a child forked in that moment may have some updated GUCs, but not
others.  But even besides that, it "feels wrong", in a way I can't fully
articulate.

However, it's not exactly obvious *where* workers should be started / stopped
instead.

One approach would be to put the adjustment in PostmasterStateMachine(), but
currently a config reload request doesn't trigger an invocation of
PostmasterStateMachine().

We have a somewhat similar case for bgworkers, for those we have a dedicated
call in ServerLoop().  I guess we could just copy that, but it doesn't seem
exactly pretty.


2) New state machine phase for AIO shutdown?

During non-crash shutdown, we currently stop archiver and walsender last
(except dead-end children, but those don't matter in this context). But at
least walsender might use AIO, so we need to shut down AIO after walsender.

We already have PM_SHUTDOWN, PM_SHUTDOWN_2. I didn't want to introduce
PM_SHUTDOWN_3. For now there's a new PM_SHUTDOWN_IO.

Is that the right approach?

But somehow the number of shutdown phases seems to be getting out of control,
besides PM_SHUTDOWN* we also have PM_STOP_BACKENDS, PM_WAIT_BACKENDS,
PM_WAIT_DEAD_END and PM_NO_CHILDREN...


3) Do we need an new PM* phase at startup?

Because the startup process - obviously - can perform IO, IO workers need to
be started before the startup process is started.

We already do this with checkpointer and bgwriter. For those we don't have a
new phase, we just do so before forking the startup process. That means that
the startup process might need to wait for checkpointer to finish starting up,
but that seems fine to me.  The same seems true for IO workers.

Therefore I don't see a need for a new phase?


4) Resetting pmState during crash-restart?

For the IO worker handling we need to check the phase we currently are in to
know whether to start new workers (so we don't start more workers while
shutting down).

That's easy during normal startup, the state is PM_INIT. But for a
crash-restart (bottom of PostmasterStateMachine()), pmState will be
PM_NO_CHILDREN - in which we would *NOT* want to start new IO workers normally
(if e.g. a SIGHUP is processed while in that state). But we do need to start
IO workers at that point in the crash-restart cycle.

The AIO patchset has dealt with that by moving pmState = PM_STARTUP to a bit
earlier. But that seems a *bit* weird, even though I don't see any negative
consequences right now.

Would it be better to reset pmState to PM_INIT? That doesn't seem quite right
either, it's described as "postmaster starting".


5) btmask_all_except2() not used anymore, btmask_all_except3() needed

The one place using btmask_all_except2() now also needs to allow IO workers,
and thus btmask_all_except3(). Which triggers warnings about
btmask_all_except2 being unused.

We could remove btmask_all_except2(), ifdef it out or create a more generic
version that works with arbitrary numbers of arguments.

Something like

static inline BackendTypeMask
btmask_all_except_n(int nargs, BackendType *t)
{
    BackendTypeMask mask = BTYPE_MASK_ALL;

    for (int i = 0; i < nargs; i++)
        mask = btmask_del(mask, t[i]);
    return mask;
}

#define btmask_all_except(...) \
    btmask_all_except_n( \
        lengthof(((BackendType[]){__VA_ARGS__})), \
        (BackendType[]){__VA_ARGS__} \
    )

should do the trick, I think.


6) Variable numbered aux processes

The IO workers are aux processes - that seemed to be the most
appropriate. However, if "real" AIO is configured (PGC_POSTMASTER), no IO
workers can be started.

The only change the patch had to make for that is:

-#define NUM_AUXILIARY_PROCS     6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS     (6 + MAX_IO_WORKERS)

Which means that we'll reserve some resources for IO workers even if no IO
workers can be started. Do we consider that a problem?  There's some precedent
- we reserve space for archiver even if not configured, despite archive_mode
being PGC_POSTMASTER - but of course that's a different scale.

I guess this could be argued to better be in a separate thread...

Greetings,

Andres Freund



Re: Managing IO workers in postmaster's state machine

From
Cédric Villemain
Date:
Hi Andres,

Thanks for sharing these detailed questions and insights. Below are some 
comments and additional thoughts that might help, particularly regarding 
bgworkers.


On 09/12/2024 22:42, Andres Freund wrote:

> Hi,
>
> For AIO, when using the worker process based "AIO", we need to manage a set of
> workers. These need to be started by postmaster, obviously.
>
> I have a few questions about the design details. They seemed mostly
> independent of the larger AIO thread, so I started this separately.
>
>
> 1) Where to start/stop workers after a config reload?
>
> The GUC that controls how many workers are allowed is PGC_SIGHUP. Therefore we
> need to adjust the number of workers after a config reload.
>
> Right now this is implemented (I think by Thomas, but it could have been me),
> in a GUC assign hook. But that seems rather iffy.
>
> For example, I think this means that, as we're in the middle of the GUC reload
> process, a child forked in that moment may have some updated GUCs, but not
> others.  But even besides that, it "feels wrong", in a way I can't fully
> articulate.
>
> However, it's not exactly obvious *where* workers should be started / stopped
> instead.
>
> One approach would be to put the adjustment in PostmasterStateMachine(), but
> currently a config reload request doesn't trigger an invocation of
> PostmasterStateMachine().
>
> We have a somewhat similar case for bgworkers, for those we have a dedicated
> call in ServerLoop().  I guess we could just copy that, but it doesn't seem
> exactly pretty.


I am unsure why background workers are managed directly in the 
ServerLoop(), wouldn't it help if a "background worker launcher" exist 
instead ?

> 2) New state machine phase for AIO shutdown?
>
> During non-crash shutdown, we currently stop archiver and walsender last
> (except dead-end children, but those don't matter in this context). But at
> least walsender might use AIO, so we need to shut down AIO after walsender.
>
> We already have PM_SHUTDOWN, PM_SHUTDOWN_2. I didn't want to introduce
> PM_SHUTDOWN_3. For now there's a new PM_SHUTDOWN_IO.
>
> Is that the right approach?
>
> But somehow the number of shutdown phases seems to be getting out of control,
> besides PM_SHUTDOWN* we also have PM_STOP_BACKENDS, PM_WAIT_BACKENDS,
> PM_WAIT_DEAD_END and PM_NO_CHILDREN...


For the startup there is a StartupStatusEnum, it serves distinct purpose 
but maybe having ShutdownSequence/ShutdownStatus can be helpful: just 
outline order of operations for the shutdown state ? Maybe nearest from 
a chain of responsability than the existing state machine...

> 3) Do we need an new PM* phase at startup?
>
> Because the startup process - obviously - can perform IO, IO workers need to
> be started before the startup process is started.
>
> We already do this with checkpointer and bgwriter. For those we don't have a
> new phase, we just do so before forking the startup process. That means that
> the startup process might need to wait for checkpointer to finish starting up,
> but that seems fine to me.  The same seems true for IO workers.
>
> Therefore I don't see a need for a new phase?

So it should belong to PM_INIT? but the next question:

> 4) Resetting pmState during crash-restart?
>
> For the IO worker handling we need to check the phase we currently are in to
> know whether to start new workers (so we don't start more workers while
> shutting down).
>
> That's easy during normal startup, the state is PM_INIT. But for a
> crash-restart (bottom of PostmasterStateMachine()), pmState will be
> PM_NO_CHILDREN - in which we would *NOT* want to start new IO workers normally
> (if e.g. a SIGHUP is processed while in that state). But we do need to start
> IO workers at that point in the crash-restart cycle.
>
> The AIO patchset has dealt with that by moving pmState = PM_STARTUP to a bit
> earlier. But that seems a *bit* weird, even though I don't see any negative
> consequences right now.
>
> Would it be better to reset pmState to PM_INIT? That doesn't seem quite right
> either, it's described as "postmaster starting".

The comments in postmaster.c said that crash recovery is like shutdown 
followed by startup (not by init).
If I understood correctly IO workers will need to be restarted in such 
context, so they must be after PM_INIT ...


> 5) btmask_all_except2() not used anymore, btmask_all_except3() needed
>
> The one place using btmask_all_except2() now also needs to allow IO workers,
> and thus btmask_all_except3(). Which triggers warnings about
> btmask_all_except2 being unused.
>
> We could remove btmask_all_except2(), ifdef it out or create a more generic
> version that works with arbitrary numbers of arguments.
>
> Something like
>
> static inline BackendTypeMask
> btmask_all_except_n(int nargs, BackendType *t)
> {
>      BackendTypeMask mask = BTYPE_MASK_ALL;
>
>      for (int i = 0; i < nargs; i++)
>          mask = btmask_del(mask, t[i]);
>      return mask;
> }
>
> #define btmask_all_except(...) \
>      btmask_all_except_n( \
>          lengthof(((BackendType[]){__VA_ARGS__})), \
>          (BackendType[]){__VA_ARGS__} \
>      )
>
> should do the trick, I think.
>
>
> 6) Variable numbered aux processes
>
> The IO workers are aux processes - that seemed to be the most
> appropriate. However, if "real" AIO is configured (PGC_POSTMASTER), no IO
> workers can be started.
>
> The only change the patch had to make for that is:
>
> -#define NUM_AUXILIARY_PROCS     6
> +#define MAX_IO_WORKERS          32
> +#define NUM_AUXILIARY_PROCS     (6 + MAX_IO_WORKERS)
>
> Which means that we'll reserve some resources for IO workers even if no IO
> workers can be started. Do we consider that a problem?  There's some precedent
> - we reserve space for archiver even if not configured, despite archive_mode
> being PGC_POSTMASTER - but of course that's a different scale.


I have not checked the history, I have read some details on this topic, 
but the reasoning is not yet very clear to me as **why** this is 
hardcoded.  If there's a technical or historical reason for the current 
design, it would be useful to revisit that in this context.
What prevent the number of background workers to be dynamically sized 
today ?

---
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D