Thread: Timeout control within tests

Timeout control within tests

From
Noah Misch
Date:
On Fri, Feb 05, 2021 at 03:55:20PM -0500, Tom Lane wrote:
> We have, almost invariably, regretted it when we tried to use short
> timeouts in test cases.

> More generally, sometimes people want to do things like run a test
> under valgrind.  So it's not just "underpowered machines" that may
> need a generous timeout.  Even if we did reduce the default, I'd
> want a way (probably via an environment variable, cf PGCTLTIMEOUT)
> to kick it back up.

I have a few use cases for that:

1. My buildfarm members have more and more competition for CPU and I/O.
   Examples where I suspect animal slowness caused a 180s timeout:
   https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-04-11%2003%3A11%3A39
   https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-26%2004%3A38%3A29

2. When I'm developing a change and I locally break a test in a way that leads
   to a timeout, I like to be able to lower that timeout.

3. I want more tests to use the right timeout from the start.  Low-timeout
   tests appear at least a few times per year:
   d03eeab Mon May 31 00:29:58 2021 -0700 Raise a timeout to 180s, in test 010_logical_decoding_timelines.pl.
   388b959 Sat Feb 27 07:02:56 2021 -0800 Raise a timeout to 180s, in contrib/test_decoding.
   08dde1b Tue Dec 22 11:10:12 2020 -0500 Increase timeout in 021_row_visibility.pl.
   8961355 Sat Apr 25 18:45:27 2020 -0700 Raise a timeout to 180s, in test 003_recovery_targets.pl.
   1db439a Mon Dec 10 20:15:42 2018 -0800 Raise some timeouts to 180s, in test code.

I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
timeout used in the places that currently hard-code 180s.  TAP tests should
retrieve the value via $PostgreSQL::Test::Utils::timeout_default.  pg_regress
tests should retrieve it via \getenv.  I would like to back-patch the TAP
part, for cause (1).  (The pg_regress part hasn't been a buildfarm problem,
and \getenv is new in v15.)  Patches attached.  I considered and excluded
other changes, for now:

a. I considered consolidating this with PGISOLATIONTIMEOUT (default 300).  One
   could remove the older variable entirely or make isolationtester use the
   first-available of [PGISOLATIONTIMEOUT, 2 * PG_TEST_TIMEOUT_DEFAULT, 360].
   Does anyone have an opinion on what, if anything, to do there?

b. I briefly made stats.sql accept PG_TEST_TIMEOUT_DEFAULT to override its
   hard-coded 30s timeout.  However, a higher timeout won't help when a UDP
   buffer fills.  If the test were structured to observe evidence of a vacant
   UDP buffer before proceeding with the test stat messages, a higher timeout
   could make more sense.  I added a comment.

c. One could remove timeout-duration function arguments (e.g. from
   pg_recvlogical_upto) and just have the function consult timeout_default.
   This felt like highly-optional refactoring.

Attachment

Re: Timeout control within tests

From
Andres Freund
Date:
Hi,

On 2022-02-17 21:28:42 -0800, Noah Misch wrote:
> I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
> timeout used in the places that currently hard-code 180s.

Meson's test runner has the concept of a "timeout multiplier" for ways of
running tests. Meson's stuff is about entire tests (i.e. one tap test), so
doesn't apply here, but I wonder if we shouldn't do something similar?  That
way we could adjust different timeouts with one setting, instead of many
different fobs to adjust?

Greetings,

Andres Freund



Re: Timeout control within tests

From
Noah Misch
Date:
On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:
> On 2022-02-17 21:28:42 -0800, Noah Misch wrote:
> > I propose to have environment variable PG_TEST_TIMEOUT_DEFAULT control the
> > timeout used in the places that currently hard-code 180s.
> 
> Meson's test runner has the concept of a "timeout multiplier" for ways of
> running tests. Meson's stuff is about entire tests (i.e. one tap test), so
> doesn't apply here, but I wonder if we shouldn't do something similar?

Hmmm.  It is good if the user can express an intent that continues to make
sense if we change the default timeout.  For the buildfarm use case, a
multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
beats PG_TEST_TIMEOUT_DEFAULT=18000).  For the hacker use case, an absolute
value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
PG_TEST_TIMEOUT_MULTIPLIER=.016666).

> That
> way we could adjust different timeouts with one setting, instead of many
> different fobs to adjust?

I expect multiplier vs. absolute value doesn't change the expected number of
settings.  If this change proceeds, we'd have three: PG_TEST_TIMEOUT_DEFAULT,
PGCTLTIMEOUT, and PGISOLATIONTIMEOUT.  PGCTLTIMEOUT is separate for conceptual
reasons, and PGISOLATIONTIMEOUT is separate for historical reasons.  There's
little use case for setting them to unequal values.  If Meson can pass down
the overall timeout in effect for the test file, we could compute all three
variables from the passed-down value.  Orthogonal to Meson, as I mentioned, we
could eliminate PGISOLATIONTIMEOUT.

timeouts.spec used to have substantial timeouts that had to elapse for the
test to pass.  (Commit 741d7f1 ended that era.)  A multiplier would have been
a good fit for that use case.  If a similar test came back, we'd likely want
two multipliers, a low one for elapsing timeouts and a high one for
non-elapsing timeouts.  A multiplier of 10-100 is reasonable for non-elapsing
timeouts, with the exact value being irrelevant on the buildfarm.  Setting an
elapsing timeout higher than necessary causes measurable waste.

One could argue for offering both a multiplier variable and an absolute-value
variable.  If there's just one variable, I think the absolute-value variable
is more compelling, due to the aforementioned hacker use case.  What do you
think?



Re: Timeout control within tests

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:
>> Meson's test runner has the concept of a "timeout multiplier" for ways of
>> running tests. Meson's stuff is about entire tests (i.e. one tap test), so
>> doesn't apply here, but I wonder if we shouldn't do something similar?

> Hmmm.  It is good if the user can express an intent that continues to make
> sense if we change the default timeout.  For the buildfarm use case, a
> multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
> beats PG_TEST_TIMEOUT_DEFAULT=18000).  For the hacker use case, an absolute
> value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
> PG_TEST_TIMEOUT_MULTIPLIER=.016666).

FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
finding that smaller values didn't work reliably in the buildfarm.
Now maybe 741d7f1 fixed that, but I wouldn't count on it.  So while I
approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
centralized setting, I think that we might need to have a multiplier
there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
across the board.  Perhaps the latter is fine, but a multiplier seems a
bit more flexible.

On the other hand, I also support your point that an absolute setting
is easier to think about / adjust for special uses.  So maybe we should
just KISS and use a single absolute setting until we find a hard reason
why that doesn't work well.

            regards, tom lane



Re: Timeout control within tests

From
Noah Misch
Date:
On Fri, Feb 18, 2022 at 10:26:52AM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:
> >> Meson's test runner has the concept of a "timeout multiplier" for ways of
> >> running tests. Meson's stuff is about entire tests (i.e. one tap test), so
> >> doesn't apply here, but I wonder if we shouldn't do something similar?
> 
> > Hmmm.  It is good if the user can express an intent that continues to make
> > sense if we change the default timeout.  For the buildfarm use case, a
> > multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
> > beats PG_TEST_TIMEOUT_DEFAULT=18000).  For the hacker use case, an absolute
> > value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
> > PG_TEST_TIMEOUT_MULTIPLIER=.016666).
> 
> FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
> finding that smaller values didn't work reliably in the buildfarm.
> Now maybe 741d7f1 fixed that, but I wouldn't count on it.  So while I
> approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
> centralized setting, I think that we might need to have a multiplier
> there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
> across the board.  Perhaps the latter is fine, but a multiplier seems a
> bit more flexible.

The PGISOLATIONTIMEOUT replacement was 2*timeout_default, so isolation suites
would get 2*180s=360s.  (I don't want to lower any default timeouts, but I
don't mind raising them.)  In a sense, PG_TEST_TIMEOUT_DEFAULT is a multiplier
with as many sites as possible multiplying it by 1.  The patch has multiples
at two code sites.

> On the other hand, I also support your point that an absolute setting
> is easier to think about / adjust for special uses.  So maybe we should
> just KISS and use a single absolute setting until we find a hard reason
> why that doesn't work well.



Re: Timeout control within tests

From
Noah Misch
Date:
(I pushed the main patch as f2698ea, on 2022-03-04.)

On Fri, Feb 18, 2022 at 06:41:36PM -0800, Noah Misch wrote:
> On Fri, Feb 18, 2022 at 10:26:52AM -0500, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Thu, Feb 17, 2022 at 09:48:25PM -0800, Andres Freund wrote:
> > >> Meson's test runner has the concept of a "timeout multiplier" for ways of
> > >> running tests. Meson's stuff is about entire tests (i.e. one tap test), so
> > >> doesn't apply here, but I wonder if we shouldn't do something similar?
> > 
> > > Hmmm.  It is good if the user can express an intent that continues to make
> > > sense if we change the default timeout.  For the buildfarm use case, a
> > > multiplier is moderately better on that axis (PG_TEST_TIMEOUT_MULTIPLIER=100
> > > beats PG_TEST_TIMEOUT_DEFAULT=18000).  For the hacker use case, an absolute
> > > value is substantially better on that axis (PG_TEST_TIMEOUT_DEFAULT=3 beats
> > > PG_TEST_TIMEOUT_MULTIPLIER=.016666).
> > 
> > FWIW, I'm fairly sure that PGISOLATIONTIMEOUT=300 was selected after
> > finding that smaller values didn't work reliably in the buildfarm.
> > Now maybe 741d7f1 fixed that, but I wouldn't count on it.  So while I
> > approve of the idea to remove PGISOLATIONTIMEOUT in favor of using this
> > centralized setting, I think that we might need to have a multiplier
> > there, or else we'll end up with PG_TEST_TIMEOUT_DEFAULT set to 300
> > across the board.  Perhaps the latter is fine, but a multiplier seems a
> > bit more flexible.
> 
> The PGISOLATIONTIMEOUT replacement was 2*timeout_default, so isolation suites
> would get 2*180s=360s.  (I don't want to lower any default timeouts, but I
> don't mind raising them.)  In a sense, PG_TEST_TIMEOUT_DEFAULT is a multiplier
> with as many sites as possible multiplying it by 1.  The patch has multiples
> at two code sites.

Here's the PGISOLATIONTIMEOUT replacement patch.  I waffled on whether to
back-patch.  Since it affects only isolation suite testing, only on systems
too slow for the default timeout, it's not a major decision.  I currently plan
not to back-patch, since slow systems that would have wanted a back-patch can
just set both variables.

Attachment