Re: pgsql: Attempt to fix unstable regression tests, take 2 - Mailing list pgsql-committers

From David Rowley
Subject Re: pgsql: Attempt to fix unstable regression tests, take 2
Date
Msg-id CAHoyFK9pHKPHyEp35QXo9NzkFOeupyRNONuEFgej4U54=Cmj2w@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Attempt to fix unstable regression tests, take 2  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: Attempt to fix unstable regression tests, take 2  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-committers
On Tue, 31 Mar 2020 at 15:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been trying to reproduce this by dint of running just the stats_ext
> script, over and over in a loop.  I've not had any success on fast
> machines, but on a slow one (florican's host) I got this after a few
> hundred iterations:

I've had a 13 year old laptop running just stats_ext in a loop for
about an hour now. I managed to get 1000 runs without any failure.
Trying again with autovacuum_naptime set to 1s... 1000 runs, and
nothing yet.

If you disable autovacuum on the problem table, can you still
reproduce the failure on that machine?

> Now this *IS* autovacuum interference, but it's hardly autovacuum's fault:
> the test script is supposing that autovac won't come in before it does a
> manual analyze, and that's just unsafe on its face.

Why would that matter?  The manual operation will just overwrite what
autovacuum did.  Obviously, there can't be any overlap due to the
ShareUpdateExclusiveLock.

My suspicion was that autovacuum ran a vacuum *after* the VACUUM
(ANALYZE). I've not studied the code, but I've had thoughts that the
manual operation might have slotted in just between when autovacuum
checked what work there was to do and when it actually did the work.
Unsure how likely that is given that we have table_recheck_autovac().

> I'm thinking that what we ought to do is have this test disable autovac
> altogether on its tables, ie
> CREATE TABLE ... WITH (autovacuum_enabled = off);
>
> However, I remain suspicious that there's something else going on,
> unrelated to autovac.  All the buildfarm cases so far have been
> small underestimates, one or two rows, so they look entirely different
> from the example above.  Even if autovacuum is firing unexpectedly,
> how would it cause such results?

Perhaps we can remain suspicious if we still see failures after fixing
it to disable autovacuum on these tables.  It seems to happen often
enough that if we don't see it again in a week, then we might be able
to assume that was the issue.

David



pgsql-committers by date:

Previous
From: Bruce Momjian
Date:
Subject: pgsql: doc: remove mention of bitwise operators as solely type-limited
Next
From: Tom Lane
Date:
Subject: Re: pgsql: Attempt to fix unstable regression tests, take 2