Re: pgsql: Improve runtime and output of tests for replication slots checkp - Mailing list pgsql-committers

From Alexander Korotkov
Subject Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date
Msg-id CAPpHfdtnkGPMSrTaWs3vGP-FYKOXhUZq3OHDeaZwHXsAQ2g=WA@mail.gmail.com
Whole thread Raw
In response to pgsql: Improve runtime and output of tests for replication slots checkp  (Alexander Korotkov <akorotkov@postgresql.org>)
List pgsql-committers
On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote:
> > > > Yeah, that's what I think too.  The unintentional omission of a
> > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > fragility in pg_logical_slot_get_changes().  So I'm not in favor
> > > > of changing 046 till we understand this better.
> > >
> > > Yes, better to understand what's going on before changing the test,
> > > and perhaps change 046 so as it provide stable coverage for this case,
> > > even if discovered accidentally.
> > >
> > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > as it has a logic too permissive?  How did any of you reproduce the
> > > failure?  Just by running the test in a loop?  It is one of these
> > > patterns where a hardcoded sleep should do the trick and help with a
> > > bisect.
> >
> > Personally I just run the test in the loop.  It takes about ~30 times to reproduce.  Works both with -O0 and -O2.
> >
> > cd src/test/recovery
> > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> >
> > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > you expect the checkpoint to complete before sending the immediate
> > > shutdown request, I would suggest to use a wait_for_log() based on
> > > "checkpoint complete" or an equivalent loop.  What you are doing in
> > > the test is unstable as written.
> >
> > Exactly.  I've proposed the fix with wait_for_log() in [1].  Nevertheless, both cases (immediate stop before
checkpointcompletion, and immediate stop after checkpoint completion) must work without hang. 
>
> CFBot has been much more red than usual since this change afaict. Many
> more windows builds are failing than usual with an error like this:
>
> [08:28:52.683] 338/338 postgresql:recovery /
> recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
>
> How about we revert the commit for now to get CI and the buildfarm green again?

What about removing the 046_checkpoint_logical_slot which currently
causes all the buzz?  I'm not yet convinced we need to revert
ca307d5cec90.  Opinions?

------
Regards,
Alexander Korotkov
Supabase



pgsql-committers by date:

Previous
From: John Naylor
Date:
Subject: pgsql: Properly fix AVX-512 CRC calculation bug
Next
From: Tom Lane
Date:
Subject: pgsql: Include _mm512_zextsi128_si512() in AVX-512 configure probes.