Re: Stabilizing the test_decoding checks, take N - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Stabilizing the test_decoding checks, take N
Date
Msg-id 20220417143104.eftldj7d73rqtqoc@alap3.anarazel.de
Whole thread Raw
In response to Stabilizing the test_decoding checks, take N  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Stabilizing the test_decoding checks, take N  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
> My pet dinosaur prairiedog just failed in the contrib/test_decoding
> tests [1]:
> 
> diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out    2022-04-15
07:59:17.000000000-0400
 
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out    2022-04-15
09:06:36.000000000-0400
 
> @@ -77,10 +77,12 @@
>   streaming change for transaction
>   streaming change for transaction
>   streaming change for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
>   streaming change for transaction
>   closing a streamed block for transaction
>   committing streamed transaction
> -(13 rows)
> +(15 rows)
> 
> Looking at the postmaster log, it's obvious where this extra transaction
> came from: auto-analyze ran on pg_type concurrently with the test step
> just before this one.  That could only happen if the tests ran long enough
> for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> (And I hasten to point out that some other animals, such as those running
> valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
> 
> We've seen this sort of problem before [2], and attempted to fix it [3]
> by making these tests ignore empty transactions.  But of course
> auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes.  I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.


> I think the most expedient way to prevent this type of failure is to run
> the test_decoding tests with autovacuum_naptime cranked up so far as to
> make it a non-issue, like maybe a day.  Since test_decoding already adds
> some custom settings to postgresql.conf, this'll take just a one-line
> addition to test_decoding/logical.conf.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
Next
From: Andres Freund
Date:
Subject: Re: Stabilizing the test_decoding checks, take N