It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:
io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500ms
It grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out. IO jobs were already
concentrated into the lowest numbered workers, partly because that
seemed to have marginally better latency than anything else tried so
far due to latch collapsing with lucky timing, and partly in
anticipation of this.
The patch also reduces bogus wakeups a bit by being a bit more
cautious about fanout. That could probably be improved a lot more and
needs more research. It's quite tricky to figure out how to suppress
wakeups without throwing potential concurrency away.
The first couple of patches are independent of this topic, and might
be potential cleanups/fixes for master/v18. The last is a simple
latency test.
Ideas, testing, flames etc welcome.