Home > mailing lists
Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Automatically sizing the IO worker pool
Date	April 6 21:14:27
Msg-id	jp7lhn2wng5mlhn3jvn4t6nmr23l6bnim5af3qslnoguevfqqz@d7cvksm62j4j Whole thread
In response to	Re: Automatically sizing the IO worker pool (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: Automatically sizing the IO worker pool
List	pgsql-hackers
Tree view
Hi,

On 2026-04-07 03:02:52 +1200, Thomas Munro wrote:
> Here's an updated patch.  It's mostly just rebased over the recent
> firehose, but with lots of comments and a few names (hopefully)
> improved.  There is one code change to highlight though:
>
> maybe_start_io_workers() knows when it's not allowed to create new
> workers, an interesting case being FatalError before we have started
> the new world.

*worker, I assume?


> The previous coding of DetermineSleepTime() didn't
> know about that, so it could return 0 (don't sleep), and then the
> postmaster could busy-wait for restart progress.

In master or the prior version of your patch?


> Maybe there were
> other cases like that, but in general DetermineSleepTime() and
> maybe_start_io_workers() really need to be 100% in agreement.  So I
> have moved that knowledge into a new function
> maybe_start_io_workers_scheduled_at().  Both DetermineSleepTime() and
> maybe_start_io_workers() call that so there is a single source of
> truth.
>
> I think I got confused about that because it's not that obvious why
> the existing code doesn't test FatalError.
>
> I thought of a slightly bigger refactoring that might deconfuse
> DetermineSleepTime() a bit more.  Probably material for the next
> cycle, but basically the idea is to stop using a bunch of different
> conditions and different units of time and convert the whole thing to
> a simple find-the-lowest-time function.  I kept that separate.
>
> I'll post a new version of the patch that was v3-0002 separately.


> From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Sat, 22 Mar 2025 00:36:49 +1300
> Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.
>
> The size of the I/O worker pool used to implement io_method=worker was
> previously controlled by the io_workers setting, defaulting to 3.  It
> was hard to know how to tune it effectively.  It is now replaced with:
>
>   io_min_workers=1
>   io_max_workers=8 (up to 32)
>   io_worker_idle_timeout=60s
>   io_worker_launch_interval=100ms

I'm a bit concerned about defaulting to io_min_workers=1. That means in an
intermittent workload, there will be no IO concurrency for short running but
IO intensive queries, while having the dispatch overhead to the worker.  It
can still be a win if the query is CPU intensive, but far from all are.

I'd therefore argue that the minimum ought to be at least 2.


> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> index 6f13e8f40a0..c42564500c6 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c


> @@ -1555,14 +1558,13 @@ checkControlFile(void)
>  static int
>  DetermineSleepTime(void)
>  {
> -    TimestampTz next_wakeup = 0;
> +    TimestampTz next_wakeup;
>
>      /*
> -     * Normal case: either there are no background workers at all, or we're in
> -     * a shutdown sequence (during which we ignore bgworkers altogether).
> +     * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout,
> +     * ignore everything else and wait for that.
>       */
> -    if (Shutdown > NoShutdown ||
> -        (!StartWorkerNeeded && !HaveCrashedWorker))
> +    if (Shutdown >= ImmediateShutdown || FatalError)
>      {
>          if (AbortStartTime != 0)
>          {
> @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
>
>              return seconds * 1000;
>          }
> -        else
> -            return 60 * 1000;
>      }
>
> -    if (StartWorkerNeeded)
> +    /* Time of next maybe_start_io_workers() call, or 0 for none. */
> +    next_wakeup = maybe_start_io_workers_scheduled_at();
> +
> +    /* Ignore bgworkers during shutdown. */
> +    if (StartWorkerNeeded && Shutdown == NoShutdown)
>          return 0;

Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
here?

> -    if (HaveCrashedWorker)
> +    if (HaveCrashedWorker && Shutdown == NoShutdown)
>      {
>          dlist_mutable_iter iter;
>


> @@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
>          StartWorkerNeeded = true;
>      }
>
> +    /* Process IO worker start requests. */
> +    if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
> +    {
> +        /*
> +         * No local flag, as the state is exposed through pgaio_worker_*()
> +         * functions.  This signal is received on potentially actionable level
> +         * changes, so that maybe_start_io_workers() will run.
> +         */
> +    }
>      /* Process background worker state changes. */
>      if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
>      {

Absolute nitpick - the different blocks so far have been separated by an empty
line.



> +    /* Only proceed if a "grow" request is pending from existing workers. */
> +    if (!pgaio_worker_test_grow())
> +        return 0;

So this accesses shared memory from postmaster.  I think this amount of access
is safe enough that that's ok. You'd have to somehow have corrupted
postmaster's copy of io_worker_control, or unmapped the shared memory it is
pointed to, for that to cause a crash.  The first shouldn't be an issue, the
latter would be quite the confusion fo the state machine.


> +/*
> + * Start I/O workers if required.  Used at startup, to respond to change of
> + * the io_min_workers GUC, when asked to start a new one due to submission
> + * queue backlog, and after workers terminate in response to errors (by
> + * starting "replacement" workers).
> + */
> +static void
> +maybe_start_io_workers(void)
> +{
> +    TimestampTz scheduled_at;
>
> -    /* Not enough running? */
> -    while (io_worker_count < io_workers)
> +    while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
>      {
> +        TimestampTz now = GetCurrentTimestamp();
>          PMChild    *child;
>          int            i;
>
> +        Assert(pmState < PM_WAIT_IO_WORKERS);
> +
> +        /* Still waiting for the scheduled time? */
> +        if (scheduled_at > now)
> +            break;
> +
> +        /* Clear the grow request flag if it is set. */
> +        pgaio_worker_clear_grow();
> +
> +        /*
> +         * Compute next launch time relative to the previous value, so that
> +         * time spent on the postmaster's other duties don't result in an
> +         * inaccurate launch interval.
> +         */
> +        io_worker_launch_next_time =
> +            TimestampTzPlusMilliseconds(io_worker_launch_next_time,
> +                                        io_worker_launch_interval);
> +
> +        /*
> +         * If that's already in the past, the interval is either impossibly
> +         * short or we received no requests for new workers for a period.
> +         * Compute a new future time relative to the last launch time instead.
> +         */
> +        if (io_worker_launch_next_time <= now)
> +            io_worker_launch_next_time =
> +                TimestampTzPlusMilliseconds(io_worker_launch_last_time,
> +                                            io_worker_launch_interval);

Did you intend to use TimestampTzPlusMilliseconds(now, ...) here?  Or did you
want to have this if after the next line:

> +        io_worker_launch_last_time = now;
> +

Because otherwise I don't understand how this is intended to work.


>          /* find unused entry in io_worker_children array */
>          for (i = 0; i < MAX_IO_WORKERS; ++i)
>          {
> @@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void)
>              ++io_worker_count;
>          }
>          else
> -            break;                /* try again next time */
> -    }
> -
> -    /* Too many running? */
> -    if (io_worker_count > io_workers)
> -    {
> -        /* ask the IO worker in the highest slot to exit */
> -        for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
>          {
> -            if (io_worker_children[i] != NULL)
> -            {
> -                kill(io_worker_children[i]->pid, SIGUSR2);
> -                break;
> -            }
> +            /*
> +             * Fork failure: we'll try again after the launch interval
> +             * expires, or be called again without delay if we don't yet have
> +             * io_min_workers.  Don't loop here though, the postmaster has
> +             * other duties.
> +             */
> +            break;
>          }
>      }
>  }

Reading just this part of the diff I am wondering what is reponsible for
reducing the number of workers below the max after a config change.  I assume
it's done in the workers, but it might be worth putting a comment here noting
that.

> +/* Debugging support: show current IO and wakeups:ios statistics in ps. */
> +/* #define PGAIO_WORKER_SHOW_PS_INFO */
>
>  typedef struct PgAioWorkerSubmissionQueue
>  {
> @@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueue
>
>  typedef struct PgAioWorkerSlot
>  {
> -    Latch       *latch;
> -    bool        in_use;
> +    ProcNumber    proc_number;
>  } PgAioWorkerSlot;
>
> +/*
> + * Sets of worker IDs are held in a simple bitmap, accessed through functions
> + * that provide a more readable abstraction.  If we wanted to support more
> + * workers than that, the contention on the single queue would surely get too
> + * high, so we might want to consider multiple pools instead of widening this.
> + */
> +typedef uint64 PgAioWorkerSet;

> +#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
> +
> +static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small");
> +
>  typedef struct PgAioWorkerControl
>  {
> -    uint64        idle_worker_mask;
> +    /* Seen by postmaster */
> +    volatile bool grow;

What's that volatile intending to do here? It avoids the needs for some
compiler barriers, but it's not clear to me those would be needed here anyway.
And it doesn't imply memory ordering, which I'm not sure is entirely wise
here.  I'd probably just plop a full memory barrier in the few relevant
places, easier to reason about that way, and it can't matter given the
infrequency of access.  I'd say we should just use a proper atomic, but right
now I don't think we do that in postmaster.


> +    /* Protected by AioWorkerSubmissionQueueLock. */
> +    PgAioWorkerSet idle_worker_set;
> +
> +    /* Protected by AioWorkerControlLock. */
> +    PgAioWorkerSet worker_set;
> +    int            nworkers;
> +
> +    /* Protected by AioWorkerControlLock. */
>      PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
>  } PgAioWorkerControl;
>
> @@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
>
>
> +static bool
> +pgaio_worker_set_is_empty(PgAioWorkerSet *set)
> +{
> +    return *set == 0;
> +}
> +
> +static PgAioWorkerSet
> +pgaio_worker_set_singleton(int worker)
> +{
> +    return UINT64_C(1) << worker;
> +}

I guess an assert about `worker` being small enough wouldn't hurt.


> +static void
> +pgaio_worker_set_fill(PgAioWorkerSet *set)
> +{
> +    *set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS);
> +}

What does "_fill" really mean?  Just that all valid bits are set? Why wouldn't
it be _all() or _full()?


> +static int
> +pgaio_worker_set_get_highest(PgAioWorkerSet *set)
> +{
> +    Assert(!pgaio_worker_set_is_empty(set));
> +    return pg_leftmost_one_pos64(*set);
> +}

"worker_set_get*" reads quite awkwardly.  Maybe just going for
pgaio_workerset_* would help?

Or maybe just name it PgAioWset/pgaio_wset_ or such?


> +static void
> +pgaio_worker_grow(bool grow)
> +{
> +    /*
> +     * This is called from sites that don't hold AioWorkerControlLock, but
> +     * these values change infrequently and an up-to-date value is not
> +     * required for this heuristic purpose.
> +     */

Is it actually useful to do this while not holding the control lock?  Ah, I
see, this is due to the split of submission and control lock.


> +    if (!grow)
> +    {
> +        /* Avoid dirtying memory if not already set. */
> +        if (io_worker_control->grow)
> +            io_worker_control->grow = false;

Hm. pgaio_worker_grow(grow=false) is a bit odd.  And this is basically a copy
of pgaio_worker_cancel_grow() - I realize that's intended for postmaster, but
somehow it's a bit odd.

Maybe just name it pgaio_worker_set_grow()?



> +/*
> + * Called by the postmaster to check if a new worker is needed.
> + */
> +bool
> +pgaio_worker_test_grow(void)
> +{
> +    return io_worker_control && io_worker_control->grow;
> +}
> +
> +/*
> + * Called by the postmaster to clear the grow flag.
> + */
> +void
> +pgaio_worker_clear_grow(void)
> +{
> +    if (io_worker_control)
> +        io_worker_control->grow = false;
> +}

Maybe we should add _pm_ in there to make it clearer that they're not for
general use?


> @@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>  {
>      PgAioHandle **synchronous_ios = NULL;
>      int            nsync = 0;
> -    Latch       *wakeup = NULL;
> -    int            worker;
> +    int            worker = -1;
>
>      Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
>
> @@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>                  break;
>              }
>
> -            if (wakeup == NULL)
> -            {
> -                /* Choose an idle worker to wake up if we haven't already. */
> -                worker = pgaio_worker_choose_idle();
> -                if (worker >= 0)
> -                    wakeup = io_worker_control->workers[worker].latch;
> -
> -                pgaio_debug_io(DEBUG4, staged_ios[i],
> -                               "choosing worker %d",
> -                               worker);
> -            }
> +            /* Choose one worker to wake for this batch. */
> +            if (worker == -1)
> +                worker = pgaio_worker_choose_idle(0);
>          }

If we only want to do this once per "batch", why not just do it outside the
num_staged_ios loop?


> @@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>  static void
>  pgaio_worker_die(int code, Datum arg)
>  {
> -    LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> -    Assert(io_worker_control->workers[MyIoWorkerId].in_use);
> -    Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
> +    PgAioWorkerSet notify_set;
>
> -    io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> -    io_worker_control->workers[MyIoWorkerId].in_use = false;
> -    io_worker_control->workers[MyIoWorkerId].latch = NULL;
> +    LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> +    pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId);
>      LWLockRelease(AioWorkerSubmissionQueueLock);
> +
> +    LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
> +    Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
> +    io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
> +    Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId));
> +    pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId);
> +    notify_set = io_worker_control->worker_set;
> +    Assert(io_worker_control->nworkers > 0);
> +    io_worker_control->nworkers--;
> +    Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
> +           io_worker_control->nworkers);
> +    LWLockRelease(AioWorkerControlLock);
> +
> +    /* Notify other workers on pool change. */

Why are we notifying them on pool changes?


> +    while (!pgaio_worker_set_is_empty(¬ify_set))
> +        pgaio_worker_wake(pgaio_worker_set_pop_lowest(¬ify_set));

I did already wonder further up if pgaio_worker_wake() should just receive a
worker_set as an argument.


> @@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg)
>  static void
>  pgaio_worker_register(void)
>  {
> +    PgAioWorkerSet free_worker_set;
> +    PgAioWorkerSet old_worker_set;
> +
>      MyIoWorkerId = -1;
>
> -    /*
> -     * XXX: This could do with more fine-grained locking. But it's also not
> -     * very common for the number of workers to change at the moment...
> -     */
> -    LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> +    LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);

I guess it could be useful to assert that nworkers is small enough before
doing anything.


> +    pgaio_worker_set_fill(&free_worker_set);
> +    pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set);
> +    if (!pgaio_worker_set_is_empty(&free_worker_set))
> +        MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set);
> +    if (MyIoWorkerId == -1)
> +        elog(ERROR, "couldn't find a free worker ID");

I'd probably add a comment saying "/* find lowest unused worker ID */" or
such, that was more immediately obvious in the old code.


> +/*
> + * Check if this backend is allowed to time out, and thus should use a
> + * non-infinite sleep time.  Only the highest-numbered worker is allowed to
> + * time out, and only if the pool is above io_min_workers.  Serializing
> + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
> + * io_min_workers.

But it's ok if a lower numbered worker errors out, right?  There will be a
temporary gap, but we will start a new worker for it?  Does that happen even
if there's a shrink of the set of required workers at the same time as a lower
numbered worker errors out?


> @@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>      while (!ShutdownRequestPending)
>      {
>          uint32        io_index;
> -        Latch       *latches[IO_WORKER_WAKEUP_FANOUT];
> -        int            nlatches = 0;
> -        int            nwakeups = 0;
> -        int            worker;
> +        int            worker = -1;
> +        int            queue_depth = 0;
> +        bool        grow = false;
>
>          /*
>           * Try to get a job to do.
> @@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>          LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
>          if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
>          {
> -            /*
> -             * Nothing to do.  Mark self idle.
> -             *
> -             * XXX: Invent some kind of back pressure to reduce useless
> -             * wakeups?
> -             */
> -            io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
> +            /* Nothing to do.  Mark self idle. */
> +            pgaio_worker_set_insert(&io_worker_control->idle_worker_set,
> +                                    MyIoWorkerId);
>          }
>          else
>          {
>              /* Got one.  Clear idle flag. */
> -            io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> +            pgaio_worker_set_remove(&io_worker_control->idle_worker_set,
> +                                    MyIoWorkerId);

Wonder if we should keep track of whether we marked ourselves idle to avoid
needing to do that.  But that would be a separate optimization really.



> +            /*
> +             * See if we should wake up a higher numbered peer.  Only do that
> +             * if this worker is not receiving spurious wakeups itself.

The "not receiving spurious wakeups" condition is wakeups < ios?

I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.


> +             * This heuristic tries to discover the useful wakeup propagation
> +             * chain length when IOs are very fast and workers wake up to find
> +             * that all IOs have already been taken.
> +             *
> +             * If we chose not to wake a worker when we ideally should have,
> +             * the ratio will soon be corrected.
> +             */
> +            if (wakeups <= ios)
>              {
> +                queue_depth = pgaio_worker_submission_queue_depth();
> +                if (queue_depth > 0)
> +                {
> +                    worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);

Is it a problem that we are passing an ID that's potentially bigger than the
biggest legal worker ID?  It's probably fine as long as MAX_WORKERS is 32 and
the bitmap is a 64bit integer, but ...


> +                    /*
> +                     * If there were no idle higher numbered peers and there
> +                     * are more than enough IOs queued for me and all lower
> +                     * numbered peers, then try to start a new worker.
> +                     */
> +                    if (worker == -1 && queue_depth > MyIoWorkerId)
> +                        grow = true;
> +                }

We probably shouldn't request growth when already at the cap? That could
generate a *lot* of pmsignal traffic, I think?



I don't have an immediate intuitive understanding of why the submission queue
depth is a good measure here.

If there are 10 workers that are busy 100% of the time, and the submission
queue is usually 6 deep with not-being-worked-on IOs, why do we not want to
start more workers?

It actually seems to work - but I don't actually understand why.


ninja install-test-files
io_max_workers=32
debug_io_direct=data
effective_io_concurrency=16
shared_buffers=5GB

pgbench -i -q -s 100 --fillfactor=30

CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 -
1)::int4FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;
 

My test is:

SET effective_io_concurrency = 20;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;


We end up with ~24-28 workers, even though we never have more than 20 IOs in
flight. Not entirely sure why. I guess it's just that after doing an IO the
worker needs to mark itself idle etc?



>          if (io_index != -1)
>          {
>              PgAioHandle *ioh = NULL;
>
> +            /* Cancel timeout and update wakeup:work ratio. */
> +            idle_timeout_abs = 0;
> +            if (++ios == PGAIO_WORKER_STATS_MAX)
> +            {
> +                wakeups /= 2;
> +                ios /= 2;
> +            }


/* Saturation for counters used to estimate wakeup:work ratio. */
#define PGAIO_WORKER_STATS_MAX 4

STATS_MAX sounds like it's just about some reporting or such.


>              ioh = &pgaio_ctl->io_handles[io_index];
>              error_ioh = ioh;
>              errcallback.arg = ioh;
> @@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>              }
>  #endif
>
> +#ifdef PGAIO_WORKER_SHOW_PS_INFO
> +            sprintf(cmd, "%d: [%s] %s",
> +                    MyIoWorkerId,
> +                    pgaio_io_get_op_name(ioh),
> +                    pgaio_io_get_target_description(ioh));
> +            set_ps_display(cmd);
> +#endif

Note that this leaks memory. See the target_description comment:

/*
 * Return a stringified description of the IO's target.
 *
 * The string is localized and allocated in the current memory context.
 */


>              /*
>               * We don't expect this to ever fail with ERROR or FATAL, no need
>               * to keep error_ioh set to the IO.
> @@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>          }
>          else
>          {
> -            WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> -                      WAIT_EVENT_IO_WORKER_MAIN);
> +            int            timeout_ms;
> +
> +            /* Cancel new worker request if pending. */
> +            pgaio_worker_grow(false);

That seems to happen very frequently.


> +            /* Compute the remaining allowed idle time. */
> +            if (io_worker_idle_timeout == -1)
> +            {
> +                /* Never time out. */
> +                timeout_ms = -1;
> +            }
> +            else
> +            {
> +                TimestampTz now = GetCurrentTimestamp();
> +
> +                /* If the GUC changes, reset timer. */
> +                if (idle_timeout_abs != 0 &&
> +                    io_worker_idle_timeout != timeout_guc_used)
> +                    idle_timeout_abs = 0;
> +
> +                /* On first sleep, compute absolute timeout. */
> +                if (idle_timeout_abs == 0)
> +                {
> +                    idle_timeout_abs =
> +                        TimestampTzPlusMilliseconds(now,
> +                                                    io_worker_idle_timeout);
> +                    timeout_guc_used = io_worker_idle_timeout;
> +                }
> +
> +                /*
> +                 * All workers maintain the absolute timeout value, but only
> +                 * the highest worker can actually time out and only if
> +                 * io_min_workers is satisfied.  All others wait only for
> +                 * explicit wakeups caused by queue insertion, wakeup
> +                 * propagation, change of pool size (possibly promoting one to
> +                 * new highest) or GUC reload.
> +                 */
> +                if (pgaio_worker_can_timeout())
> +                    timeout_ms =
> +                        TimestampDifferenceMilliseconds(now,
> +                                                        idle_timeout_abs);
> +                else
> +                    timeout_ms = -1;


Hm. This way you get very rapid worker pool reductions.  Configured
io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
count after the work finishes:

Mon 06 Apr 2026 02:08:28 PM EDT (every 1s)

count
32
(1 row)
Mon 06 Apr 2026 02:08:29 PM EDT (every 1s)

count
32
(1 row)
Mon 06 Apr 2026 02:08:30 PM EDT (every 1s)

count
1
(1 row)
Mon 06 Apr 2026 02:08:31 PM EDT (every 1s)

count
1
(1 row)


Of course this is a ridiculuously low setting, but it does seems like starting
the timeout even when not the highest numbered worker will lead to a lot of
quick yoyoing.




Greetings,

Andres Freund
pgsql-hackers by date:
From: Jim Jones
Date: 06 April, 21:13:45
Subject: Re: VACUUM FULL, CLUSTER, and REPACK block on other sessions' temp tables
From: Mihail Nikalayeu
Date: 06 April, 21:21:00
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers

Previous

Next