Thread: pgsql: Attempt to fix unstable regression tests, take 2

pgsql: Attempt to fix unstable regression tests, take 2

From
David Rowley
Date:
Attempt to fix unstable regression tests, take 2

Following up on 2dc16efed, petalura has suffered some additional
failures in stats_ext which again appear to be around the timing of an
autovacuum during the test, causing instability in the row estimates.

Again, let's fix this by explicitly performing a VACUUM on the table
and not leave it to happen by chance of an autovacuum pass.

Discussion: https://postgr.es/m/CAApHDvok5hmXr%2BbUbJe7%2B2sQzWo4B_QzSk7RKFR9fP6BjYXx5g%40mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/24566b359d095c3800c2a326d88a595722813f58

Modified Files
--------------
src/test/regress/expected/stats_ext.out | 10 +++++-----
src/test/regress/sql/stats_ext.sql      | 10 +++++-----
2 files changed, 10 insertions(+), 10 deletions(-)


Re: pgsql: Attempt to fix unstable regression tests, take 2

From
Tom Lane
Date:
David Rowley <drowley@postgresql.org> writes:
> Attempt to fix unstable regression tests, take 2

That still isn't working, and I'm starting to think that we're
blaming the messenger.  To wit, what we are seeing here is that
these tests are unstable ... and that's a bug of the tests, not of
any adjustments of autovacuum timing.  I notice that at least
most of the deltas are in test cases that are only a couple weeks
old, so they don't have a lot of track record saying that they
are stable.

Specifically, I notice that in the last couple of failures involving
tests on "mcv_lists", we are considering stats on a 5000-row table.
With the standard setting default_statistics_target = 100, ANALYZE
will take a 3000-row random sample, meaning that its results are
*inherently* not 100% reproducible.  In the past we have avoided
that being a problem for the regression tests by dint of not
depending on exact estimation results, but this test is designed
to do precisely that.

I think what needs to happen here is to revert these changes
you've made, and instead either reduce the size of the test table
or increase default_statistics_target within this test, so that
ANALYZE's sample includes the whole table and is thereby completely
stable.  (I'd tend to favor the former approach if possible, just
to keep a lid on test runtime.)

Or we could decide that these tests are a bad idea and toss them
completely; but I bet Tomas will object.

            regards, tom lane



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
Tom Lane
Date:
I wrote:
> Specifically, I notice that in the last couple of failures involving
> tests on "mcv_lists", we are considering stats on a 5000-row table.
> With the standard setting default_statistics_target = 100, ANALYZE
> will take a 3000-row random sample, meaning that its results are
> *inherently* not 100% reproducible.

No, wait, scratch that --- I misremembered the multiplier.  Actually
ANALYZE will try to acquire a 30000-row sample, so its sample *should*
always include the entire table.  Autovacuum or not, the stats ought
to come out exactly the same every time in this test case.  But they
are not doing so, at least on some machines.  We need to figure out
why that is.

Anyway, I remain suspicious that the instability is the fault
of something in the statistics code, not of autovacuum.

            regards, tom lane



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
Tom Lane
Date:
I wrote:
> Anyway, I remain suspicious that the instability is the fault
> of something in the statistics code, not of autovacuum.

I've been trying to reproduce this by dint of running just the stats_ext
script, over and over in a loop.  I've not had any success on fast
machines, but on a slow one (florican's host) I got this after a few
hundred iterations:

diff -U3 /usr/home/tgl/pgsql/src/test/regress/expected/stats_ext.out
/usr/home/tgl/pgsql/src/test/regress/results/stats_ext.out
--- /usr/home/tgl/pgsql/src/test/regress/expected/stats_ext.out 2020-03-30 21:15:19.749874000 -0400
+++ /usr/home/tgl/pgsql/src/test/regress/results/stats_ext.out  2020-03-30 22:29:04.062103000 -0400
@@ -1172,7 +1172,7 @@
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
  estimated | actual
 -----------+--------
-         1 |     50
+        50 |     50
 (1 row)

 VACUUM (ANALYZE) mcv_lists;

Now this *IS* autovacuum interference, but it's hardly autovacuum's fault:
the test script is supposing that autovac won't come in before it does a
manual analyze, and that's just unsafe on its face.

I'm thinking that what we ought to do is have this test disable autovac
altogether on its tables, ie
CREATE TABLE ... WITH (autovacuum_enabled = off);

However, I remain suspicious that there's something else going on,
unrelated to autovac.  All the buildfarm cases so far have been
small underestimates, one or two rows, so they look entirely different
from the example above.  Even if autovacuum is firing unexpectedly,
how would it cause such results?

            regards, tom lane



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
David Rowley
Date:
On Tue, 31 Mar 2020 at 15:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been trying to reproduce this by dint of running just the stats_ext
> script, over and over in a loop.  I've not had any success on fast
> machines, but on a slow one (florican's host) I got this after a few
> hundred iterations:

I've had a 13 year old laptop running just stats_ext in a loop for
about an hour now. I managed to get 1000 runs without any failure.
Trying again with autovacuum_naptime set to 1s... 1000 runs, and
nothing yet.

If you disable autovacuum on the problem table, can you still
reproduce the failure on that machine?

> Now this *IS* autovacuum interference, but it's hardly autovacuum's fault:
> the test script is supposing that autovac won't come in before it does a
> manual analyze, and that's just unsafe on its face.

Why would that matter?  The manual operation will just overwrite what
autovacuum did.  Obviously, there can't be any overlap due to the
ShareUpdateExclusiveLock.

My suspicion was that autovacuum ran a vacuum *after* the VACUUM
(ANALYZE). I've not studied the code, but I've had thoughts that the
manual operation might have slotted in just between when autovacuum
checked what work there was to do and when it actually did the work.
Unsure how likely that is given that we have table_recheck_autovac().

> I'm thinking that what we ought to do is have this test disable autovac
> altogether on its tables, ie
> CREATE TABLE ... WITH (autovacuum_enabled = off);
>
> However, I remain suspicious that there's something else going on,
> unrelated to autovac.  All the buildfarm cases so far have been
> small underestimates, one or two rows, so they look entirely different
> from the example above.  Even if autovacuum is firing unexpectedly,
> how would it cause such results?

Perhaps we can remain suspicious if we still see failures after fixing
it to disable autovacuum on these tables.  It seems to happen often
enough that if we don't see it again in a week, then we might be able
to assume that was the issue.

David



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
Tom Lane
Date:
David Rowley <dgrowley@gmail.com> writes:
> On Tue, 31 Mar 2020 at 15:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now this *IS* autovacuum interference, but it's hardly autovacuum's fault:
>> the test script is supposing that autovac won't come in before it does a
>> manual analyze, and that's just unsafe on its face.

> Why would that matter?

Look again at the failure: the problem is that the test script is
populating a table, then doing an EXPLAIN and expecting to see
results corresponding to a *not*-ANALYZED table, then doing ANALYZE,
then expecting to see EXPLAIN results corresponding to the analyzed
state.  It's the second step of that that is vulnerable to an
ill-timed auto analyze.  The only way to prevent it is to disable
autovac altogether on the table, as I did a little while ago
at 0936d1b6f.

This is of course not like the cases we've actually seen so far
in the buildfarm, but it's a case that I produced once and it
would surely recur.

It will be interesting to see if 0936d1b6f really fixes the issue
altogether or the instability continues --- but if it does, then
autovacuum is not the problem.

            regards, tom lane



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
David Rowley
Date:
On Wed, 1 Apr 2020 at 13:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowley@gmail.com> writes:
> > On Tue, 31 Mar 2020 at 15:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Now this *IS* autovacuum interference, but it's hardly autovacuum's fault:
> >> the test script is supposing that autovac won't come in before it does a
> >> manual analyze, and that's just unsafe on its face.
>
> > Why would that matter?
>
> Look again at the failure: the problem is that the test script is
> populating a table, then doing an EXPLAIN and expecting to see
> results corresponding to a *not*-ANALYZED table, then doing ANALYZE,
> then expecting to see EXPLAIN results corresponding to the analyzed
> state.  It's the second step of that that is vulnerable to an
> ill-timed auto analyze.  The only way to prevent it is to disable
> autovac altogether on the table, as I did a little while ago
> at 0936d1b6f.

Can you share which failure you're talking about here?  All of the
ones I've looked at were failing post-ANALYZE.

David



Re: pgsql: Attempt to fix unstable regression tests, take 2

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Can you share which failure you're talking about here?  All of the
> ones I've looked at were failing post-ANALYZE.

It's what I posted yesterday:

diff -U3 /usr/home/tgl/pgsql/src/test/regress/expected/stats_ext.out
/usr/home/tgl/pgsql/src/test/regress/results/stats_ext.out
--- /usr/home/tgl/pgsql/src/test/regress/expected/stats_ext.out 2020-03-30 21:15:19.749874000 -0400
+++ /usr/home/tgl/pgsql/src/test/regress/results/stats_ext.out  2020-03-30 22:29:04.062103000 -0400
@@ -1172,7 +1172,7 @@
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
  estimated | actual
 -----------+--------
-         1 |     50
+        50 |     50
 (1 row)

 VACUUM (ANALYZE) mcv_lists;

That was with stats_ext as of f01157e2ac8ac4dff8ba159c36edf2fdb7d6704e.

            regards, tom lane