Re: Intermittent buildfarm failures on wrasse - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Intermittent buildfarm failures on wrasse
Date
Msg-id 1364997.1649893086@sss.pgh.pa.us
Whole thread Raw
In response to Re: Intermittent buildfarm failures on wrasse  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Intermittent buildfarm failures on wrasse  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
>> IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all
visible.

> Sometimes OldestXmin can go backwards in VACUUM operations that are
> run in close succession against the same table, due to activity from
> other databases in the same cluster (perhaps other factors are
> involved at times).

I've been doing some testing locally by inserting commands to
manually set tenk1's relallvisible to zero.  I first did that
in test_setup.sql ... and it had no effect whatsoever.  Further
experimentation showed that the "CREATE INDEX ON tenk1" steps
in create_index.sql itself generally suffice to fix relallvisible;
although if you force it back to zero after the last such command,
you get the same plan diffs wrasse is showing.  And you don't
get any others, which I thought curious until I realized that
sanity_check.sql's database-wide VACUUM offers yet another
opportunity to heal the incorrect value.  If you force it back
to zero again after that, a bunch of later tests start to show
plan differences, which is what I'd been expecting.

So what seems to be happening on wrasse is that a background
autovacuum (or really autoanalyze?) is preventing pages from
being marked all-visible not only during test_setup.sql but
also create_index.sql; but it's gone by the time sanity_check.sql
runs.  Which is odd in itself because not that much time elapses
between create_index and sanity_check, certainly less than the
time from test_setup to create_index.

It seems like a reliable fix might require test_setup to wait
for any background autovac to exit before it does its own
vacuums.  Ick.

And we still lack an explanation of why this only now broke.
I remain suspicious that pgstats is behaving unexpectedly.

            regards, tom lane



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: shared-memory based stats collector - v70
Next
From: Peter Geoghegan
Date:
Subject: Re: Intermittent buildfarm failures on wrasse