Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken
Date
Msg-id CA+hUKG+ogAon8_V223Ldv6taPR2uKH3X_UJ_A7LJAf3-VRARPA@mail.gmail.com
Whole thread Raw
In response to Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Fri, Mar 10, 2023 at 1:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Mar 10, 2023 at 11:37 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> > On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:
> > > Is it reasonable to assume that all modern platforms can time
> > > millisecond delays accurately?  Ten years ago I'd have suggested
> > > truncating the delay to a multiple of 10msec and using this logic
> > > to track the remainder, but maybe now that's unnecessary.
> >
> > If so, it might also be worth updating or removing this comment in
> > pgsleep.c:
> >
> >      * NOTE: although the delay is specified in microseconds, the effective
> >      * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> >      * the requested delay to be rounded up to the next resolution boundary.
> >
> > I've had doubts for some time about whether this is still accurate...

Unfortunately I was triggered by this Unix archeology discussion, and
wasted some time this weekend testing every Unix we target.  I found 3
groups:

1.  OpenBSD, NetBSD: Like the comment says, kernel ticks still control
sleep resolution.  I measure an average time of ~20ms when I ask for
1ms sleeps in a loop with select() or nanosleep().  I don't actually
understand why it's not ~10ms because HZ is 100 on these systems, but
I didn't look harder.

2.  AIX, Solaris, illumos: select() can sleep for 1ms accurately, but
not fractions of 1ms.  If you use nanosleep() instead of select(),
then AIX joins the third group (huh, maybe it's just that its
select(us) calls poll(ms) under the covers?), but Solaris does not
(maybe it's still tick-based, but HZ == 1000?).

3.  Linux, FreeBSD, macOS: sub-ms sleeps are quite accurate (via
various system calls).

I didn't test Windows but it sounds a lot like it is in group 1 if you
use WaitForMultipleObjects() or SleepEx(), as we do.

You can probably tune some of the above; for example FreeBSD can go
back to the old way with kern.eventtimer.periodic=1 to get a thousand
interrupts per second (kern.hz) instead of programming a hardware
timer to get an interrupt at just the right time, and then 0.5ms sleep
requests get rounded to an average of 1ms, just like on Solaris.  And
power usage probably goes up.

As for what do do about it, I dunno, how about this?

  * NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to limit
+ * sleeping.  This may cause sleeps to be rounded up by as much as 1-20
+ * milliseconds on old Unixen and Windows.

As for the following paragraph about the dangers of select() and
interrupts and restarts, I suspect it is describing the HP-UX
behaviour (a dropped platform), which I guess must have led to POSIX's
reluctance to standardise that properly, but in any case all
hypothetical concerns would disappear if we just used POSIX
[clock_]nanosleep(), no?  It has defined behaviour on signals, and it
also tells you the remaining time (if we cared, which we wouldn't).



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [BUG] pg_stat_statements and extended query protocol
Next
From: Peter Geoghegan
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)