Intermittent buildfarm failures on wrasse - Mailing list pgsql-hackers

From Tom Lane
Subject Intermittent buildfarm failures on wrasse
Date
Msg-id 1346227.1649887693@sss.pgh.pa.us
Whole thread Raw
Responses Re: Intermittent buildfarm failures on wrasse  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
For the past five days or so, wrasse has been intermittently
failing due to unexpectedly not using an Index Only Scan plan
in the create_index test [1], eg

@@ -1910,11 +1910,15 @@
 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
-   Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+   Sort Key: unique1
+   ->  Bitmap Heap Scan on tenk1
+         Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+         ->  Bitmap Index Scan on tenk1_unique1
+               Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)

 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)

The most probable explanation for this seems to be that tenk1's
pg_class.relallvisible value hasn't been set high enough to make an IOS
look cheaper than the alternatives.  Where that ought to be getting set
is the "VACUUM ANALYZE tenk1" step in test_setup.sql.  It's plausible
I guess that a background autovacuum is preventing that command from
setting relallvisible as high as it ought to be --- but if so, why
are we only seeing two plans changing, on only one animal?

But what I'm really confused about is that this test arrangement has
been stable since early February.  Why has wrasse suddenly started
showing a 25% failure rate when it never failed this way before that?
Somebody has to have recently committed a change that affects this.
Checking the commit log up to the onset of the failures on 8 April,
I only see two plausible candidates:

* shared-memory pgstats
* Peter's recent VACUUM changes

Any connection to pgstats is, um, pretty obscure.  I'd finger the VACUUM
changes as a more likely trigger except that the last interesting-looking
one was f3c15cbe5 on 3 April, and wrasse got through "make check" 38 times
after that before its first failure of this kind.  That doesn't square with
the 25% failure rate since then, so I'm kind of forced to the conclusion
that the pgstats work changed some behavior that it should not have.

Any ideas?

I'm tempted to add something like

SELECT relallvisible = relpages FROM pg_class WHERE relname = 'tenk1';

so that we can confirm or refute the theory that relallvisible is
the driving factor.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-04-08%2003%3A48%3A30



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Improving the "Routine Vacuuming" docs
Next
From: Peter Geoghegan
Date:
Subject: Re: Intermittent buildfarm failures on wrasse