Thread: SKIP_LOCKED test causes random buildfarm failures

SKIP_LOCKED test causes random buildfarm failures

From
Tom Lane
Date:
Every once in awhile we get failures like this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-11-05%2008%3A27%3A27

diff -U3 /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out
/home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out
--- /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out    2019-08-11
03:02:18.921535948-0700 
+++ /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out    2019-11-05
00:50:42.381244885-0800 
@@ -204,6 +204,7 @@
 -- SKIP_LOCKED option
 VACUUM (SKIP_LOCKED) vactst;
 VACUUM (SKIP_LOCKED, FULL) vactst;
+WARNING:  skipping vacuum of "vactst" --- lock not available
 ANALYZE (SKIP_LOCKED) vactst;
 -- ensure VACUUM and ANALYZE don't have a problem with serializable
 SET default_transaction_isolation = serializable;


No doubt this is a conflict with autovacuum.  There are two reasonable
ways to remove the test instability:

* Crank up client_min_messages to more than WARNING for this test
stanza.

* Downgrade the "skipping" messages to DEBUG1 or less.

I kind of wonder why we are issuing a "WARNING" when the statement
does exactly what you asked it to, anyway.  At most I'd expect
that to be a NOTICE condition.

            regards, tom lane



Re: SKIP_LOCKED test causes random buildfarm failures

From
Andres Freund
Date:
Hi,

On 2019-11-06 16:54:38 -0500, Tom Lane wrote:
> Every once in awhile we get failures like this one:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-11-05%2008%3A27%3A27
> 
> diff -U3 /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out
/home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out
> --- /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/../pgsql/src/test/regress/expected/vacuum.out
2019-08-1103:02:18.921535948 -0700
 
> +++ /home/pgsql/build-farm/buildroot-clang/HEAD/pgsql.build/src/test/regress/results/vacuum.out    2019-11-05
00:50:42.381244885-0800
 
> @@ -204,6 +204,7 @@
>  -- SKIP_LOCKED option
>  VACUUM (SKIP_LOCKED) vactst;
>  VACUUM (SKIP_LOCKED, FULL) vactst;
> +WARNING:  skipping vacuum of "vactst" --- lock not available
>  ANALYZE (SKIP_LOCKED) vactst;
>  -- ensure VACUUM and ANALYZE don't have a problem with serializable
>  SET default_transaction_isolation = serializable;
> 
> 
> No doubt this is a conflict with autovacuum.  There are two reasonable
> ways to remove the test instability:

I assume you consider disabling autovacuum for that table not a
reasonable approach? Due to the danger that it could end up still
running, e.g. due to anti-wraparound or such?


> * Crank up client_min_messages to more than WARNING for this test
> stanza.

Hm.


> * Downgrade the "skipping" messages to DEBUG1 or less.
> 
> I kind of wonder why we are issuing a "WARNING" when the statement
> does exactly what you asked it to, anyway.  At most I'd expect
> that to be a NOTICE condition.

I don't know what lead us to doing so, but it doesn't seem reasonable to
allow the user to see whether the table has actually been vacuumed. I
would assume that one uses SKIP_LOCKED partially to avoid unnecessary
impacts in production due to other tasks starting to block on e.g. a
VACUUM FULL, even though without the "ordered queueing" everything could
just go on working fine. I'm not sure that indicates whether WARNING or
NOTICE is the best choice.

So I'd be inclined to go with the client_min_messages approach?


Greetings,

Andres Freund



Re: SKIP_LOCKED test causes random buildfarm failures

From
Michael Paquier
Date:
On Wed, Nov 06, 2019 at 03:01:11PM -0800, Andres Freund wrote:
> I don't know what lead us to doing so, but it doesn't seem reasonable to
> allow the user to see whether the table has actually been vacuumed. I
> would assume that one uses SKIP_LOCKED partially to avoid unnecessary
> impacts in production due to other tasks starting to block on e.g. a
> VACUUM FULL, even though without the "ordered queueing" everything could
> just go on working fine. I'm not sure that indicates whether WARNING or
> NOTICE is the best choice.

Good question.  That's a historical choice, still I have seen cases
where those warnings are helpful while not making the logs too
verbose to see some congestion in the jobs.

> So I'd be inclined to go with the client_min_messages approach?

The main purpose of the tests in regress/ is to check after the
grammar, so using client_min_messages sounds like a plan.  We have
a second set of tests in isolation/ where I would actually like to
disable autovacuum by default on a subset of tables.  Thoughts about
the attached?
--
Michael

Attachment

Re: SKIP_LOCKED test causes random buildfarm failures

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Nov 06, 2019 at 03:01:11PM -0800, Andres Freund wrote:
>> I don't know what lead us to doing so, but it doesn't seem reasonable to
>> allow the user to see whether the table has actually been vacuumed. I
>> would assume that one uses SKIP_LOCKED partially to avoid unnecessary
>> impacts in production due to other tasks starting to block on e.g. a
>> VACUUM FULL, even though without the "ordered queueing" everything could
>> just go on working fine. I'm not sure that indicates whether WARNING or
>> NOTICE is the best choice.

> Good question.  That's a historical choice, still I have seen cases
> where those warnings are helpful while not making the logs too
> verbose to see some congestion in the jobs.

I kind of feel that NOTICE is more semantically appropriate, but
perhaps there's an argument for keeping it at WARNING.

>> So I'd be inclined to go with the client_min_messages approach?

> The main purpose of the tests in regress/ is to check after the
> grammar, so using client_min_messages sounds like a plan.  We have
> a second set of tests in isolation/ where I would actually like to
> disable autovacuum by default on a subset of tables.  Thoughts about
> the attached?

I do not want to fix this in the main tests by disabling autovacuum,
because that'd actually reduce the tests' cross-section.  The fact
that this happens occasionally is a Good Thing for verifying that the
code paths actually work.  So it seems that there's a consensus on
adjusting client_min_messages to fix the test output instability ---
but we need to agree on whether to do s/WARNING/NOTICE/ first, so we
can know what to set client_min_messages to.

As for the case in the isolation test, shouldn't we also use
client_min_messages there, rather than prevent the conflict
from arising?  Or would that test fail in some larger way if
autovacuum gets in the way?

            regards, tom lane



Re: SKIP_LOCKED test causes random buildfarm failures

From
Michael Paquier
Date:
On Thu, Nov 07, 2019 at 11:50:25AM -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Good question.  That's a historical choice, still I have seen cases
>> where those warnings are helpful while not making the logs too
>> verbose to see some congestion in the jobs.
>
> I kind of feel that NOTICE is more semantically appropriate, but
> perhaps there's an argument for keeping it at WARNING.

Perhaps.  Well, that's the same level as the one used after the
permission checks on the relation vacuumed.

> I do not want to fix this in the main tests by disabling autovacuum,
> because that'd actually reduce the tests' cross-section.  The fact
> that this happens occasionally is a Good Thing for verifying that the
> code paths actually work.  So it seems that there's a consensus on
> adjusting client_min_messages to fix the test output instability ---
> but we need to agree on whether to do s/WARNING/NOTICE/ first, so we
> can know what to set client_min_messages to.

Makes sense.

> As for the case in the isolation test, shouldn't we also use
> client_min_messages there, rather than prevent the conflict
> from arising?  Or would that test fail in some larger way if
> autovacuum gets in the way?

I think that there is no risk regarding the stability of the output
because we use LOCK before from a first session on the relation to
vacuum in a second session.  So if autovacuum runs in parallel, the
consequence would be a small slow down while waiting on the lock to be
taken.  And per the way the test is ordered, it seems to me that it
makes the most sense to disable autovacuum as it would just get in the
way.  In this case I think that it is actually better to show the
messages as that makes the tests more verbose and we make sure to test
their format, even if we could just rely on the fact that VACUUM
should just be blocking or non-blocking.
--
Michael

Attachment

Re: SKIP_LOCKED test causes random buildfarm failures

From
Tom Lane
Date:
I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Good question.  That's a historical choice, still I have seen cases
>> where those warnings are helpful while not making the logs too
>> verbose to see some congestion in the jobs.

> I kind of feel that NOTICE is more semantically appropriate, but
> perhaps there's an argument for keeping it at WARNING.

Well, I'm not hearing any groundswell of support for changing the
message level, so let's leave that as-is and just see about
stabilizing the tests.

>> The main purpose of the tests in regress/ is to check after the
>> grammar, so using client_min_messages sounds like a plan.  We have
>> a second set of tests in isolation/ where I would actually like to
>> disable autovacuum by default on a subset of tables.  Thoughts about
>> the attached?

After looking more closely at the isolation test, I agree that adding
the "ALTER TABLE ... SET (autovacuum_enabled = false)" bits to it is
a good idea.  The LOCK operations should make that irrelevant for
part1, but there's at least a theoretical hazard for part2.
(Actually, is "autovacuum_enabled = false" really sufficient to
keep autovacuum away?  It'd probably lock the table for long enough
to examine its reloptions, so it seems like all we're doing here is
reducing the window for trouble a little bit.  Still, maybe that's
worthwhile.)

As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
we just remove them.  I do not see that they're offering any coverage
that's not completely redundant with this isolation test.  We don't
need to spend cycles every day on that.

            regards, tom lane



Re: SKIP_LOCKED test causes random buildfarm failures

From
Andres Freund
Date:
Hi,

On 2019-11-14 13:37:44 -0500, Tom Lane wrote:
> I wrote:
> > Michael Paquier <michael@paquier.xyz> writes:
> >> Good question.  That's a historical choice, still I have seen cases
> >> where those warnings are helpful while not making the logs too
> >> verbose to see some congestion in the jobs.
> 
> > I kind of feel that NOTICE is more semantically appropriate, but
> > perhaps there's an argument for keeping it at WARNING.
> 
> Well, I'm not hearing any groundswell of support for changing the
> message level, so let's leave that as-is and just see about
> stabilizing the tests.

Ok.


> >> The main purpose of the tests in regress/ is to check after the
> >> grammar, so using client_min_messages sounds like a plan.  We have
> >> a second set of tests in isolation/ where I would actually like to
> >> disable autovacuum by default on a subset of tables.  Thoughts about
> >> the attached?
> 
> After looking more closely at the isolation test, I agree that adding
> the "ALTER TABLE ... SET (autovacuum_enabled = false)" bits to it is
> a good idea.  The LOCK operations should make that irrelevant for
> part1, but there's at least a theoretical hazard for part2.
> (Actually, is "autovacuum_enabled = false" really sufficient to
> keep autovacuum away?  It'd probably lock the table for long enough
> to examine its reloptions, so it seems like all we're doing here is
> reducing the window for trouble a little bit.  Still, maybe that's
> worthwhile.)

+1


> As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
> we just remove them.  I do not see that they're offering any coverage
> that's not completely redundant with this isolation test.  We don't
> need to spend cycles every day on that.

-0 on that, I'd rather just put a autovacuum_enabled = false for
them. They're quick enough, and it's nice to have decent coverage of
various options within the plain regression tests when possible.

Greetings,

Andres Freund



Re: SKIP_LOCKED test causes random buildfarm failures

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-11-14 13:37:44 -0500, Tom Lane wrote:
>> As for the SKIP_LOCKED tests in vacuum.sql, what I now propose is that
>> we just remove them.  I do not see that they're offering any coverage
>> that's not completely redundant with this isolation test.  We don't
>> need to spend cycles every day on that.

> -0 on that, I'd rather just put a autovacuum_enabled = false for
> them. They're quick enough, and it's nice to have decent coverage of
> various options within the plain regression tests when possible.

If we're going to keep them in vacuum.sql, we should use the
client_min_messages fix there, as that's a full solution not just
reducing the window.  But I don't agree that these tests are worth
the cycles, given the coverage elsewhere.  The probability of breaking
this option is just not high enough to justify core-regression-test
coverage.

            regards, tom lane



Re: SKIP_LOCKED test causes random buildfarm failures

From
Michael Paquier
Date:
On Thu, Nov 14, 2019 at 03:20:09PM -0500, Tom Lane wrote:
> If we're going to keep them in vacuum.sql, we should use the
> client_min_messages fix there, as that's a full solution not just
> reducing the window.  But I don't agree that these tests are worth
> the cycles, given the coverage elsewhere.  The probability of breaking
> this option is just not high enough to justify core-regression-test
> coverage.

I would rather keep the solution with client_min_messages, and the
tests in vacuum.sql to keep those checks for the grammar parsing.  So
this basically brings us back to use the patch I proposed here:
https://www.postgresql.org/message-id/20191107013942.GA1768@paquier.xyz

Any objections?
--
Michael

Attachment

Re: SKIP_LOCKED test causes random buildfarm failures

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> I would rather keep the solution with client_min_messages, and the
> tests in vacuum.sql to keep those checks for the grammar parsing.  So
> this basically brings us back to use the patch I proposed here:
> https://www.postgresql.org/message-id/20191107013942.GA1768@paquier.xyz

OK.

            regards, tom lane



Re: SKIP_LOCKED test causes random buildfarm failures

From
Michael Paquier
Date:
On Fri, Nov 15, 2019 at 11:22:20AM -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> I would rather keep the solution with client_min_messages, and the
>> tests in vacuum.sql to keep those checks for the grammar parsing.  So
>> this basically brings us back to use the patch I proposed here:
>> https://www.postgresql.org/message-id/20191107013942.GA1768@paquier.xyz
>
> OK.

Thanks, applied and back-patched.
--
Michael

Attachment