Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Apr 13, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
>> IIRC the problem in matter isn't skipped pages, but that the horizon simply isn't new enough to mark pages as all
visible.
> Sometimes OldestXmin can go backwards in VACUUM operations that are
> run in close succession against the same table, due to activity from
> other databases in the same cluster (perhaps other factors are
> involved at times).
I've been doing some testing locally by inserting commands to
manually set tenk1's relallvisible to zero. I first did that
in test_setup.sql ... and it had no effect whatsoever. Further
experimentation showed that the "CREATE INDEX ON tenk1" steps
in create_index.sql itself generally suffice to fix relallvisible;
although if you force it back to zero after the last such command,
you get the same plan diffs wrasse is showing. And you don't
get any others, which I thought curious until I realized that
sanity_check.sql's database-wide VACUUM offers yet another
opportunity to heal the incorrect value. If you force it back
to zero again after that, a bunch of later tests start to show
plan differences, which is what I'd been expecting.
So what seems to be happening on wrasse is that a background
autovacuum (or really autoanalyze?) is preventing pages from
being marked all-visible not only during test_setup.sql but
also create_index.sql; but it's gone by the time sanity_check.sql
runs. Which is odd in itself because not that much time elapses
between create_index and sanity_check, certainly less than the
time from test_setup to create_index.
It seems like a reliable fix might require test_setup to wait
for any background autovac to exit before it does its own
vacuums. Ick.
And we still lack an explanation of why this only now broke.
I remain suspicious that pgstats is behaving unexpectedly.
regards, tom lane