tests against running server occasionally fail, postgres_fdw & tenk1 - Mailing list pgsql-hackers

From Andres Freund
Subject tests against running server occasionally fail, postgres_fdw & tenk1
Date
Msg-id 20221209001511.n3vqodxobqgscuil@awork3.anarazel.de
Whole thread Raw
Responses Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Andres Freund <andres@anarazel.de>)
Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

Since

commit 3f0e786ccbf
Author: Andres Freund <andres@anarazel.de>
Date:   2022-12-07 12:13:35 -0800

    meson: Add 'running' test setup, as a replacement for installcheck

CI tests the pg_regress/isolationtester tests that support doing so against a
running server.


Unfortunately cfbot shows that that doesn't work entirely reliably.

The most frequent case is postgres_fdw, which somewhat regularly fails with a
regression.diff like this:

diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
@@ -9911,8 +9911,7 @@
     WHERE application_name = 'fdw_retry_check';
  pg_terminate_backend
 ----------------------
- t
-(1 row)
+(0 rows)

 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.


See e.g.
https://cirrus-ci.com/task/5925540020879360

https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log


The following comment in the test provides a hint what might be happening:

-- If debug_discard_caches is active, it results in
-- dropping remote connections after every transaction, making it
-- impossible to test termination meaningfully.  So turn that off
-- for this test.
SET debug_discard_caches = 0;


I guess that a cache reset message arrives and leads to the connection being
terminated. Unfortunately that's hard to see right now, as the relevant log
messages are output with DEBUG3 - it's quite verbose, so enabling it for all
tests will be painful.

I *think* I have seen this failure locally at least once, when running the
test normally.


I'll try to reproduce this locally for a bit. If I can't, the only other idea
I have for debugging this is to change log_min_messages in that section of the
postgres_fdw test to DEBUG3.



The second failure case I observed, at a lower frequency, is in the main
regression tests:
https://cirrus-ci.com/task/5640584912699392

https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out    2022-12-08 16:49:28.239508000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out    2022-12-08
16:57:17.286650000+0000
 
@@ -1910,11 +1910,15 @@
 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
-   Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+   Sort Key: unique1
+   ->  Bitmap Heap Scan on tenk1
+         Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+         ->  Bitmap Index Scan on tenk1_unique1
+               Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)

 SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)


Which I think we've seen a number of times before, even in the temp-install
case. We fixed one source of this issue in this thread:
https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us
but it looks like there's some remaining instability.

According to the server log (link above), there's no autovacuum on
tenk1.

Unfortunately we don't log non-automatic vacuums unless they are run with
verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s
computation of 'instrument').

I don't have a better idea than to either change the above, or to revert
91998539b227dfc6dd091714da7d106f2c95a321.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Paul Ramsey
Date:
Subject: Re: [PATCH] random_normal function
Next
From: Andres Freund
Date:
Subject: Re: tests against running server occasionally fail, postgres_fdw & tenk1