Re: tests against running server occasionally fail, postgres_fdw & tenk1 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: tests against running server occasionally fail, postgres_fdw & tenk1
Date
Msg-id 20230207015300.7ql6omvdqjfks5tp@awork3.anarazel.de
Whole thread Raw
In response to Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Andres Freund <andres@anarazel.de>)
Responses Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Andres Freund <andres@anarazel.de>)
Re: tests against running server occasionally fail, postgres_fdw & tenk1  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2022-12-08 16:36:07 -0800, Andres Freund wrote:
> On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> > Unfortunately cfbot shows that that doesn't work entirely reliably.
> >
> > The most frequent case is postgres_fdw, which somewhat regularly fails with a
> > regression.diff like this:
> >
> > diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> > --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-08 20:35:24.772888000 +0000
> > +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out    2022-12-08
20:43:38.199450000+0000
 
> > @@ -9911,8 +9911,7 @@
> >      WHERE application_name = 'fdw_retry_check';
> >   pg_terminate_backend
> >  ----------------------
> > - t
> > -(1 row)
> > +(0 rows)
> >
> >  -- This query should detect the broken connection when starting new remote
> >  -- transaction, reestablish new connection, and then succeed.

> >
> > I guess that a cache reset message arrives and leads to the connection being
> > terminated. Unfortunately that's hard to see right now, as the relevant log
> > messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> > tests will be painful.
> >
> > I *think* I have seen this failure locally at least once, when running the
> > test normally.
> >
> >
> > I'll try to reproduce this locally for a bit. If I can't, the only other idea
> > I have for debugging this is to change log_min_messages in that section of the
> > postgres_fdw test to DEBUG3.
>
> Oh. I tried to reproduce the issue, without success so far, but eventually my
> test loop got stuck in something I reported previously and forgot about since:
> https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de
>
> I wonder if the timing on the freebsd CI task works out to hitting a "smaller
> version" of the problem that eventually resolves itself, which then leads to a
> sinval reset getting sent, causing the observable problem.

The issue referenced above is now fixed, and I haven't seen instances of it
since then. I also just now fixed the issue that often lead to failing to
upload logs.

Unfortunately the fdw_retry_check issue from above has re-occurred since then:

https://cirrus-ci.com/task/4901157940756480

https://api.cirrus-ci.com/v1/artifact/task/4901157940756480/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs


Another run hit an issue we've been fighting repeatedly on the buildfarm / CI:
https://cirrus-ci.com/task/5527490404286464

https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out    2023-02-06 23:52:43.627604000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out    2023-02-07
00:03:04.535232000+0000
 
@@ -1930,12 +1930,13 @@
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: thousand
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)

 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)


I'd be tempted to disable the test, but it also found genuine issues in a
bunch of CF entries, and all these test instabilities seem like ones we'd also
see, with a lower occurrence on the buildfarm.


WRT the fdw_retry_check: I wonder if we should increase the log level of
a) pgfdw_inval_callback deciding to disconnect
b) ReceiveSharedInvalidMessages() deciding to reset

to DEBUG1, at least temporarily?

Alternatively we could add a
SET log_min_messages=debug4;
...
RESET log_min_messages;

around the postgres_fdw connection re-establishment test?


One thing nudging me towards the more global approach is that I have the vague
feelign that there's a wider issue around hitting more sinval resets than we
should - and it'd right now be very hard to know about that.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Perform streaming logical transactions by background workers and parallel apply