Thread: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Melanie Plageman
Date:
Hi,

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
double which, after being passed to WaitLatch() as timeout, which is a
long, ends up being 0, so we don't end up waiting AFAICT.

When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
it is 500us, but WaitLatch() is still getting 0 as timeout.

- Melanie



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think that 4753ef37e0ed undid the work caf626b2c did to support
> sub-millisecond delays for vacuum and autovacuum.
>
> After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
> double which, after being passed to WaitLatch() as timeout, which is a
> long, ends up being 0, so we don't end up waiting AFAICT.
>
> When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
> it is 500us, but WaitLatch() is still getting 0 as timeout.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>> I think that 4753ef37e0ed undid the work caf626b2c did to support
>> sub-millisecond delays for vacuum and autovacuum.

> Given that some of the clunkier underlying kernel primitives have
> milliseconds in their interface, I don't think it would be possible to
> make a usec-based variant of WaitEventSetWait() that works everywhere.
> Could it possibly make sense to do something that accumulates the
> error, so if you're using 0.5 then every second vacuum_delay_point()
> waits for 1ms?

Yeah ... using float math there was cute, but it'd only get us so far.
The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Can we fix this by making VacuumCostBalance carry the extra fractional
delay, or would a separate variable be better?

            regards, tom lane



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Stephen Frost
Date:
Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:
> On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think that 4753ef37e0ed undid the work caf626b2c did to support
> > sub-millisecond delays for vacuum and autovacuum.
> >
> > After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
> > double which, after being passed to WaitLatch() as timeout, which is a
> > long, ends up being 0, so we don't end up waiting AFAICT.
> >
> > When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
> > it is 500us, but WaitLatch() is still getting 0 as timeout.
>
> Given that some of the clunkier underlying kernel primitives have
> milliseconds in their interface, I don't think it would be possible to
> make a usec-based variant of WaitEventSetWait() that works everywhere.
> Could it possibly make sense to do something that accumulates the
> error, so if you're using 0.5 then every second vacuum_delay_point()
> waits for 1ms?

Hmm.  That generally makes sense to me.. though isn't exactly the same.
Still, I wouldn't want to go back to purely pg_usleep() as that has the
other downsides mentioned.

Perhaps if the delay is sub-millisecond, explicitly do the WaitLatch()
with zero but also do the pg_usleep()?  That's doing a fair bit of work
beyond just sleeping, but it also means we shouldn't miss out on the
postmaster going away or similar..

Thanks,

Stephen

Attachment

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> >> I think that 4753ef37e0ed undid the work caf626b2c did to support
> >> sub-millisecond delays for vacuum and autovacuum.
>
> > Given that some of the clunkier underlying kernel primitives have
> > milliseconds in their interface, I don't think it would be possible to
> > make a usec-based variant of WaitEventSetWait() that works everywhere.
> > Could it possibly make sense to do something that accumulates the
> > error, so if you're using 0.5 then every second vacuum_delay_point()
> > waits for 1ms?
>
> Yeah ... using float math there was cute, but it'd only get us so far.
> The caf626b2c code would only work well on platforms that have
> microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true.  If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift?  And spreading it out a bit.  But I don't know.

> Can we fix this by making VacuumCostBalance carry the extra fractional
> delay, or would a separate variable be better?

I was wondering the same thing, but not being too familiar with that
code, no opinion on that yet.



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Melanie Plageman
Date:
On Thu, Mar 9, 2023 at 5:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > >> I think that 4753ef37e0ed undid the work caf626b2c did to support
> > >> sub-millisecond delays for vacuum and autovacuum.
> >
> > > Given that some of the clunkier underlying kernel primitives have
> > > milliseconds in their interface, I don't think it would be possible to
> > > make a usec-based variant of WaitEventSetWait() that works everywhere.
> > > Could it possibly make sense to do something that accumulates the
> > > error, so if you're using 0.5 then every second vacuum_delay_point()
> > > waits for 1ms?
> >
> > Yeah ... using float math there was cute, but it'd only get us so far.
> > The caf626b2c code would only work well on platforms that have
> > microsecond-based sleep primitives, so it was already not too portable.
>
> Also, the previous coding was already b0rked, because pg_usleep()
> rounds up to milliseconds on Windows (with a surprising formula for
> rounding), and also the whole concept seems to assume things about
> schedulers that aren't really universally true.  If we actually cared
> about high res times maybe we should be using nanosleep and tracking
> the drift?  And spreading it out a bit.  But I don't know.
>
> > Can we fix this by making VacuumCostBalance carry the extra fractional
> > delay, or would a separate variable be better?
>
> I was wondering the same thing, but not being too familiar with that
> code, no opinion on that yet.

Well, that is reset to zero in vacuum() at the top -- which is called for
each table for autovacuum, so it would get reset to zero between
autovacuuming tables. I dunno how you feel about that...

- Melanie



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The caf626b2c code would only work well on platforms that have
>> microsecond-based sleep primitives, so it was already not too portable.

> Also, the previous coding was already b0rked, because pg_usleep()
> rounds up to milliseconds on Windows (with a surprising formula for
> rounding), and also the whole concept seems to assume things about
> schedulers that aren't really universally true.  If we actually cared
> about high res times maybe we should be using nanosleep and tracking
> the drift?  And spreading it out a bit.  But I don't know.

Yeah, I was wondering about trying to make it a closed-loop control,
but I think that'd be huge overkill considering what the mechanism is
trying to accomplish.

A minimalistic fix could be as attached.  I'm not sure if it's worth
making the state variable global so that it can be reset to zero in
the places where we zero out VacuumCostBalance etc.  Also note that
this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
have the extra delay accumulating in unexpected places when there are
multiple workers.  But I really doubt it's worth worrying about that.

Is it reasonable to assume that all modern platforms can time
millisecond delays accurately?  Ten years ago I'd have suggested
truncating the delay to a multiple of 10msec and using this logic
to track the remainder, but maybe now that's unnecessary.

            regards, tom lane

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..64d3c59709 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2229,14 +2229,30 @@ vacuum_delay_point(void)
     /* Nap if appropriate */
     if (msec > 0)
     {
+        /*
+         * Since WaitLatch can only wait in units of milliseconds, carry any
+         * residual fractional msec in a static variable, and accumulate it to
+         * add into the wait interval next time.
+         */
+        static double residual_msec = 0;
+        long        lmsec;
+
+        msec += residual_msec;
+
         if (msec > VacuumCostDelay * 4)
             msec = VacuumCostDelay * 4;
 
-        (void) WaitLatch(MyLatch,
-                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                         msec,
-                         WAIT_EVENT_VACUUM_DELAY);
-        ResetLatch(MyLatch);
+        lmsec = floor(msec);
+        residual_msec = msec - lmsec;
+
+        if (lmsec > 0)
+        {
+            (void) WaitLatch(MyLatch,
+                             WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                             lmsec,
+                             WAIT_EVENT_VACUUM_DELAY);
+            ResetLatch(MyLatch);
+        }
 
         VacuumCostBalance = 0;


Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Nathan Bossart
Date:
On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:
> Is it reasonable to assume that all modern platforms can time
> millisecond delays accurately?  Ten years ago I'd have suggested
> truncating the delay to a multiple of 10msec and using this logic
> to track the remainder, but maybe now that's unnecessary.

If so, it might also be worth updating or removing this comment in
pgsleep.c:

     * NOTE: although the delay is specified in microseconds, the effective
     * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
     * the requested delay to be rounded up to the next resolution boundary.

I've had doubts for some time about whether this is still accurate...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Melanie Plageman
Date:
On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The caf626b2c code would only work well on platforms that have
> >> microsecond-based sleep primitives, so it was already not too portable.
>
> > Also, the previous coding was already b0rked, because pg_usleep()
> > rounds up to milliseconds on Windows (with a surprising formula for
> > rounding), and also the whole concept seems to assume things about
> > schedulers that aren't really universally true.  If we actually cared
> > about high res times maybe we should be using nanosleep and tracking
> > the drift?  And spreading it out a bit.  But I don't know.
>
> Yeah, I was wondering about trying to make it a closed-loop control,
> but I think that'd be huge overkill considering what the mechanism is
> trying to accomplish.

Not relevant to fixing this, but I wonder if you could eliminate the
need to specify the cost delay in most cases for autovacuum if you used
feedback from how much vacuuming work was done during the last cycle of
vacuuming to control the delay value internally - a kind of
feedback-adjusted controller.

- Melanie



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Melanie Plageman
Date:
On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The caf626b2c code would only work well on platforms that have
> >> microsecond-based sleep primitives, so it was already not too portable.
>
> > Also, the previous coding was already b0rked, because pg_usleep()
> > rounds up to milliseconds on Windows (with a surprising formula for
> > rounding), and also the whole concept seems to assume things about
> > schedulers that aren't really universally true.  If we actually cared
> > about high res times maybe we should be using nanosleep and tracking
> > the drift?  And spreading it out a bit.  But I don't know.
>
> Yeah, I was wondering about trying to make it a closed-loop control,
> but I think that'd be huge overkill considering what the mechanism is
> trying to accomplish.
>
> A minimalistic fix could be as attached.  I'm not sure if it's worth
> making the state variable global so that it can be reset to zero in
> the places where we zero out VacuumCostBalance etc.  Also note that
> this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
> have the extra delay accumulating in unexpected places when there are
> multiple workers.  But I really doubt it's worth worrying about that.

What if someone resets the delay guc and there is still a large residual?



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Melanie Plageman <melanieplageman@gmail.com> writes:
> On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A minimalistic fix could be as attached.  I'm not sure if it's worth
>> making the state variable global so that it can be reset to zero in
>> the places where we zero out VacuumCostBalance etc.  Also note that
>> this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
>> have the extra delay accumulating in unexpected places when there are
>> multiple workers.  But I really doubt it's worth worrying about that.

> What if someone resets the delay guc and there is still a large residual?

By definition, the residual is less than 1msec.

            regards, tom lane



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 11:37 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:
> > Is it reasonable to assume that all modern platforms can time
> > millisecond delays accurately?  Ten years ago I'd have suggested
> > truncating the delay to a multiple of 10msec and using this logic
> > to track the remainder, but maybe now that's unnecessary.
>
> If so, it might also be worth updating or removing this comment in
> pgsleep.c:
>
>      * NOTE: although the delay is specified in microseconds, the effective
>      * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
>      * the requested delay to be rounded up to the next resolution boundary.
>
> I've had doubts for some time about whether this is still accurate...

What I see with the old select(), or a more modern clock_nanosleep()
call, is that Linux, FreeBSD, macOS are happy sleeping for .1ms, .5ms,
1ms, 2ms, 3ms, and through innaccuracies and scheduling overheads etc
it works out to about 5-25% extra sleep time (I expect that can be
affected by choice of time source/available hardware, and perhaps
various system calls use different tricks).  I definitely recall the
behaviour described, back in the old days where more stuff was
scheduler-tick based.  I have no clue for Windows; quick googling
tells me that it might still be pretty chunky, unless you do certain
other stuff that I didn't follow up; we could probably get more
accurate sleep times by rummaging through nt.dll.  It would be good to
find out how well WaitEventSet does on Windows; perhaps we should have
a little timing accuracy test in the tree to collect build farm data?

FWIW epoll has a newer _pwait2() call that has higher res timeout
argument, and Windows WaitEventSet could also do high res timers if
you add timer events rather than using the timeout argument, and I
guess conceptually even the old poll() thing could do the equivalent
with a signal alarm timer, but it sounds a lot like a bad idea to do
very short sleeps to me, burning so much CPU on scheduling.  I kinda
wonder if the 10ms + residual thing might even turn out to be a better
idea... but I dunno.

The 1ms residual thing looks pretty good to me as a fix to the
immediate problem report, but we might also want to adjust the wording
in config.sgml?



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
Erm, but maybe I'm just looking at this too myopically.  Is there
really any point in letting people set it to 0.5, if it behaves as if
you'd set it to 1 and doubled the cost limit?  Isn't it just more
confusing?  I haven't read the discussion from when fractional delays
came in, where I imagine that must have come up...



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Erm, but maybe I'm just looking at this too myopically.  Is there
> really any point in letting people set it to 0.5, if it behaves as if
> you'd set it to 1 and doubled the cost limit?  Isn't it just more
> confusing?  I haven't read the discussion from when fractional delays
> came in, where I imagine that must have come up...

At [1] I argued

>> The reason is this: what we want to do is throttle VACUUM's I/O demand,
>> and by "throttle" I mean "gradually reduce".  There is nothing gradual
>> about issuing a few million I/Os and then sleeping for many milliseconds;
>> that'll just produce spikes and valleys in the I/O demand.  Ideally,
>> what we'd have it do is sleep for a very short interval after each I/O.
>> But that's not too practical, both for code-structure reasons and because
>> most platforms don't give us a way to so finely control the length of a
>> sleep.  Hence the design of sleeping for awhile after every so many I/Os.
>> 
>> However, the current settings are predicated on the assumption that
>> you can't get the kernel to give you a sleep of less than circa 10ms.
>> That assumption is way outdated, I believe; poking around on systems
>> I have here, the minimum delay time using pg_usleep(1) seems to be
>> generally less than 100us, and frequently less than 10us, on anything
>> released in the last decade.
>> 
>> I propose therefore that instead of increasing vacuum_cost_limit,
>> what we ought to be doing is reducing vacuum_cost_delay by a similar
>> factor.  And, to provide some daylight for people to reduce it even
>> more, we ought to arrange for it to be specifiable in microseconds
>> not milliseconds.  There's no GUC_UNIT_US right now, but it's time.

That last point was later overruled in favor of keeping it measured in
msec to avoid breaking existing configuration files.  Nonetheless,
vacuum_cost_delay *is* an actual time to wait (conceptually at least),
not just part of a unitless ratio; and there seem to be good arguments
in favor of letting people make it small.

I take your point that really short sleeps are inefficient so far as the
scheduling overhead goes.  But on modern machines you probably have to get
down to a not-very-large number of microseconds before that's a big deal.

            regards, tom lane

[1] https://www.postgresql.org/message-id/28720.1552101086%40sss.pgh.pa.us



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 1:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I propose therefore that instead of increasing vacuum_cost_limit,
> >> what we ought to be doing is reducing vacuum_cost_delay by a similar
> >> factor.  And, to provide some daylight for people to reduce it even
> >> more, we ought to arrange for it to be specifiable in microseconds
> >> not milliseconds.  There's no GUC_UNIT_US right now, but it's time.
>
> That last point was later overruled in favor of keeping it measured in
> msec to avoid breaking existing configuration files.  Nonetheless,
> vacuum_cost_delay *is* an actual time to wait (conceptually at least),
> not just part of a unitless ratio; and there seem to be good arguments
> in favor of letting people make it small.
>
> I take your point that really short sleeps are inefficient so far as the
> scheduling overhead goes.  But on modern machines you probably have to get
> down to a not-very-large number of microseconds before that's a big deal.

OK.  One idea is to provide a WaitLatchUsec(), which is just some
cross platform donkeywork that I think I know how to type in, and it
would have to round up on poll() and Windows builds.  Then we could
either also provide WaitEventSetResolution() that returns 1000 or 1
depending on availability of 1us waits so that we could round
appropriately and then track residual, but beyond that let the user
worry about inaccuracies and overheads (as mentioned in the
documentation), or we could start consulting the clock and tracking
our actual sleep time and true residual over time (maybe that's what
"closed-loop control" means?).



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> OK.  One idea is to provide a WaitLatchUsec(), which is just some
> cross platform donkeywork that I think I know how to type in, and it
> would have to round up on poll() and Windows builds.  Then we could
> either also provide WaitEventSetResolution() that returns 1000 or 1
> depending on availability of 1us waits so that we could round
> appropriately and then track residual, but beyond that let the user
> worry about inaccuracies and overheads (as mentioned in the
> documentation),

... so we'd still need to have the residual-sleep-time logic?

> or we could start consulting the clock and tracking
> our actual sleep time and true residual over time (maybe that's what
> "closed-loop control" means?).

Yeah, I was hand-waving about trying to measure our actual sleep times.
On reflection I doubt it's a great idea.  It'll add overhead and there's
still a question of whether measurement noise would accumulate.

            regards, tom lane



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > OK.  One idea is to provide a WaitLatchUsec(), which is just some
> > cross platform donkeywork that I think I know how to type in, and it
> > would have to round up on poll() and Windows builds.  Then we could
> > either also provide WaitEventSetResolution() that returns 1000 or 1
> > depending on availability of 1us waits so that we could round
> > appropriately and then track residual, but beyond that let the user
> > worry about inaccuracies and overheads (as mentioned in the
> > documentation),
>
> ... so we'd still need to have the residual-sleep-time logic?

Ah, perhaps not.  Considering that the historical behaviour on the
main affected platform (Windows) was already to round up to
milliseconds before we latchified this code anyway, and now a google
search is telling me that the relevant timer might in fact be *super*
lumpy, perhaps even to the point of 1/64th of a second[1] (maybe
that's a problem for a Windows hacker to look into some time; I really
should create a wiki page of known Windows problems in search of a
hacker)...  it now looks like sub-ms residual logic would be a bit
pointless after all.

I'll go and see about usec latch waits.  More soon.

[1] https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 2:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'll go and see about usec latch waits.  More soon.

Here are some experimental patches along those lines.  Seems good
locally, but I saw a random failure I don't understand on CI so
apparently I need to find a bug; at least this gives an idea of how
this might look.  Unfortunately, the new interface on Linux turned out
to be newer that I first realised: Linux 5.11+ (so RHEL 9, Debian
12/Bookworm, Ubuntu 21.04/Hirsute Hippo), so unless we're OK with it
taking a couple more years to be more widely used, we'll need some
fallback code.  Perhaps something like 0004, which also shows the sort
of thing that we might consider back-patching to 14 and 15 (next
revision I'll move that up the front and put it in back-patchable
form).  It's not exactly beautiful; maybe sharing code with recovery's
lazy PM-exit detection could help.  Of course, the new μs-based wait
API could be used wherever we do timestamp-based waiting, for no
particular reason other than that it is the resolution of our
timestamps, so there is no need to bother rounding; I doubt anyone
would notice or care much about that, but it's vote in favour of μs
rather than the other obvious contender ns, which modern underlying
kernel primitives are using.

Attachment

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 6:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... Perhaps something like 0004, which also shows the sort
> of thing that we might consider back-patching to 14 and 15 (next
> revision I'll move that up the front and put it in back-patchable
> form).

I think this is the minimal back-patchable change.  I propose to go
ahead and do that, and then to kick the ideas about latch API changes
into a new thread for the next commitfest.

Attachment

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I think this is the minimal back-patchable change.  I propose to go
> ahead and do that, and then to kick the ideas about latch API changes
> into a new thread for the next commitfest.

OK by me, but then again 4753ef37 wasn't my patch.

            regards, tom lane



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Fri, Mar 10, 2023 at 1:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Mar 10, 2023 at 11:37 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> > On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:
> > > Is it reasonable to assume that all modern platforms can time
> > > millisecond delays accurately?  Ten years ago I'd have suggested
> > > truncating the delay to a multiple of 10msec and using this logic
> > > to track the remainder, but maybe now that's unnecessary.
> >
> > If so, it might also be worth updating or removing this comment in
> > pgsleep.c:
> >
> >      * NOTE: although the delay is specified in microseconds, the effective
> >      * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> >      * the requested delay to be rounded up to the next resolution boundary.
> >
> > I've had doubts for some time about whether this is still accurate...

Unfortunately I was triggered by this Unix archeology discussion, and
wasted some time this weekend testing every Unix we target.  I found 3
groups:

1.  OpenBSD, NetBSD: Like the comment says, kernel ticks still control
sleep resolution.  I measure an average time of ~20ms when I ask for
1ms sleeps in a loop with select() or nanosleep().  I don't actually
understand why it's not ~10ms because HZ is 100 on these systems, but
I didn't look harder.

2.  AIX, Solaris, illumos: select() can sleep for 1ms accurately, but
not fractions of 1ms.  If you use nanosleep() instead of select(),
then AIX joins the third group (huh, maybe it's just that its
select(us) calls poll(ms) under the covers?), but Solaris does not
(maybe it's still tick-based, but HZ == 1000?).

3.  Linux, FreeBSD, macOS: sub-ms sleeps are quite accurate (via
various system calls).

I didn't test Windows but it sounds a lot like it is in group 1 if you
use WaitForMultipleObjects() or SleepEx(), as we do.

You can probably tune some of the above; for example FreeBSD can go
back to the old way with kern.eventtimer.periodic=1 to get a thousand
interrupts per second (kern.hz) instead of programming a hardware
timer to get an interrupt at just the right time, and then 0.5ms sleep
requests get rounded to an average of 1ms, just like on Solaris.  And
power usage probably goes up.

As for what do do about it, I dunno, how about this?

  * NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to limit
+ * sleeping.  This may cause sleeps to be rounded up by as much as 1-20
+ * milliseconds on old Unixen and Windows.

As for the following paragraph about the dangers of select() and
interrupts and restarts, I suspect it is describing the HP-UX
behaviour (a dropped platform), which I guess must have led to POSIX's
reluctance to standardise that properly, but in any case all
hypothetical concerns would disappear if we just used POSIX
[clock_]nanosleep(), no?  It has defined behaviour on signals, and it
also tells you the remaining time (if we cared, which we wouldn't).



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Sat, Mar 11, 2023 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I think this is the minimal back-patchable change.  I propose to go
> > ahead and do that, and then to kick the ideas about latch API changes
> > into a new thread for the next commitfest.
>
> OK by me, but then again 4753ef37 wasn't my patch.

I'll wait another day to see if Stephen or anyone else who hasn't hit
Monday yet wants to object.

Here also are those other minor tweaks, for master only.  I see now
that nanosleep() has already been proposed before:

https://www.postgresql.org/message-id/flat/CABQrizfxpBLZT5mZeE0js5oCh1tqEWvcGF3vMRCv5P-RwUY5dQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/4902.1552349020%40sss.pgh.pa.us

There I see the question of whether it should loop on EINTR to keep
waiting the remaining time.  Generally it seems like a job for
something higher level to deal with interruption policy, and of course
all the race condition and portability problems inherent with signals
are fixed by using latches instead, so I don't think there really is a
good answer to that question -- if you loop, you break our programming
rules by wilfully ignoring eg global barriers, but if you don't loop,
it implies you're relying on the interrupt to cause you to do
something and yet you might have missed it if it was delivered just
before the syscall.  At the time of the earlier thread, maybe it was
more acceptable as it could only delay cancel for that backend, but
now it might even delay arbitrary other backends, and neither answer
to that question can fix that in a race-free way.  Also, back then
latches had a SIGUSR1 handler on common systems, but now they don't,
so (racy unreliable) latch responsiveness has decreased since then.
So I think we should just leave the interface as it is, and build
better things and then eventually retire it.  This general topic is
also currently being discussed at:

https://www.postgresql.org/message-id/flat/20230209205929.GA720594%40nathanxps13

I propose to go ahead and make this small improvement anyway because
it'll surely be a while before we delete the last pg_usleep() call,
and it's good to spring-clean old confusing commentary about signals
and portability.

Attachment

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Nathan Bossart
Date:
>   * NOTE: although the delay is specified in microseconds, the effective
> - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> - * the requested delay to be rounded up to the next resolution boundary.
> + * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
> + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
> + * Unixen and Windows.

nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
Otherwise, I think this is the right idea.

> + * CAUTION: if interrupted by a signal, this function will return, but its
> + * interface doesn't report that.  It's not a good idea to use this
> + * for long sleeps in the backend, because backends are expected to respond to
> + * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
> + * with a timeout.

I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
then why are we concerned about delayed responses to interrupts?

> -        delay.tv_usec = microsec % 1000000L;
> -        (void) select(0, NULL, NULL, NULL, &delay);
> +        delay.tv_nsec = (microsec % 1000000L) * 1000;
> +        (void) nanosleep(&delay, NULL);

Using nanosleep() seems reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> >   * NOTE: although the delay is specified in microseconds, the effective
> > - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> > - * the requested delay to be rounded up to the next resolution boundary.
> > + * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
> > + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
> > + * Unixen and Windows.
>
> nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
> Otherwise, I think this is the right idea.

Better words welcome; 1-20ms summarises the range I actually measured,
and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
it neatly covers that too, so I don't feel too bad about not chasing
down the reason for that 10ms/20ms discrepancy; maybe I looked at the
wrong HZ number (which you can change, anyway), I'm not too used to
NetBSD...  BTW they have a project plan to fix that
https://wiki.netbsd.org/projects/project/tickless/

> > + * CAUTION: if interrupted by a signal, this function will return, but its
> > + * interface doesn't report that.  It's not a good idea to use this
> > + * for long sleeps in the backend, because backends are expected to respond to
> > + * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
> > + * with a timeout.
>
> I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
> then why are we concerned about delayed responses to interrupts?

Because you can't rely on it:

1.  Maybe the signal is delivered just before pg_usleep() begins, and
a handler sets some flag we would like to react to.  Now pg_usleep()
will not be interrupted.  That problem is solved by using latches
instead.
2.  Maybe the signal is one that is no longer handled by a handler at
all; these days, latches use SIGURG, which pops out when you read a
signalfd or kqueue, so pg_usleep() will not wake up.  That problem is
solved by using latches instead.

(The word "interrupt" is a bit overloaded, which doesn't help with
this discussion.)

> > -             delay.tv_usec = microsec % 1000000L;
> > -             (void) select(0, NULL, NULL, NULL, &delay);
> > +             delay.tv_nsec = (microsec % 1000000L) * 1000;
> > +             (void) nanosleep(&delay, NULL);
>
> Using nanosleep() seems reasonable to me.

Thanks for looking!



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Nathan Bossart
Date:
On Tue, Mar 14, 2023 at 03:38:45PM +1300, Thomas Munro wrote:
> On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> >   * NOTE: although the delay is specified in microseconds, the effective
>> > - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
>> > - * the requested delay to be rounded up to the next resolution boundary.
>> > + * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
>> > + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
>> > + * Unixen and Windows.
>>
>> nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
>> Otherwise, I think this is the right idea.
> 
> Better words welcome; 1-20ms summarises the range I actually measured,
> and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
> it neatly covers that too, so I don't feel too bad about not chasing
> down the reason for that 10ms/20ms discrepancy; maybe I looked at the
> wrong HZ number (which you can change, anyway), I'm not too used to
> NetBSD...  BTW they have a project plan to fix that
> https://wiki.netbsd.org/projects/project/tickless/

Here is roughly what I had in mind:

    NOTE: Although the delay is specified in microseconds, older Unixen and
    Windows use periodic kernel ticks to wake up, which might increase the
    delay time significantly.  We've observed delay increases as large as
    20 milliseconds on supported platforms.

>> > + * CAUTION: if interrupted by a signal, this function will return, but its
>> > + * interface doesn't report that.  It's not a good idea to use this
>> > + * for long sleeps in the backend, because backends are expected to respond to
>> > + * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
>> > + * with a timeout.
>>
>> I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
>> then why are we concerned about delayed responses to interrupts?
> 
> Because you can't rely on it:
> 
> 1.  Maybe the signal is delivered just before pg_usleep() begins, and
> a handler sets some flag we would like to react to.  Now pg_usleep()
> will not be interrupted.  That problem is solved by using latches
> instead.
> 2.  Maybe the signal is one that is no longer handled by a handler at
> all; these days, latches use SIGURG, which pops out when you read a
> signalfd or kqueue, so pg_usleep() will not wake up.  That problem is
> solved by using latches instead.
> 
> (The word "interrupt" is a bit overloaded, which doesn't help with
> this discussion.)

Yeah, I think it would be clearer if "interrupt" was disambiguated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

From
Thomas Munro
Date:
On Wed, Mar 15, 2023 at 7:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is roughly what I had in mind:
>
>         NOTE: Although the delay is specified in microseconds, older Unixen and
>         Windows use periodic kernel ticks to wake up, which might increase the
>         delay time significantly.  We've observed delay increases as large as
>         20 milliseconds on supported platforms.

Sold.  And pushed.

I couldn't let that 20ms != 1s/100 problem go, despite my claim that I
would, and now I see:  NetBSD does have 10ms resolution, so everyone
can relax, arithmetic still works.  It's just that it always or often
adds on one extra tick, for some strange reason.  So you can measure
20ms, 30ms, ... but never as low as 10ms.  *Shrug*.  Your description
covered that nicely.

https://marc.info/?l=netbsd-current-users&m=144832117108168&w=2

> > (The word "interrupt" is a bit overloaded, which doesn't help with
> > this discussion.)
>
> Yeah, I think it would be clearer if "interrupt" was disambiguated.

OK, I rewrote it to avoid that terminology.

On small detail, after reading Tom's 2019 proposal to do this[1]: He
mentioned SUSv2's ENOSYS error.  I see that SUSv3 (POSIX.1-2001)
dropped that.  Systems that don't have the "timers" option simply
shouldn't define the function, but we already require the "timers"
option for clock_gettime().  And more practically, I know that all our
target systems have it and it works.

Pushed.

[1] https://www.postgresql.org/message-id/4902.1552349020@sss.pgh.pa.us