Re: Fix 035_standby_logical_decoding.pl race conditions - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Fix 035_standby_logical_decoding.pl race conditions
Date
Msg-id CAA4eK1Ku9MdGd1L74N=YKjSvB_9-2eQLsXz-LdPS1Cfnrs8TRw@mail.gmail.com
Whole thread Raw
In response to Re: Fix 035_standby_logical_decoding.pl race conditions  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Fix 035_standby_logical_decoding.pl race conditions
List pgsql-hackers
On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > > Dear Bertrand, Amit,
> > >
> > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
> > > > > keep the slots active and only avoid doing the checks for them (they are
> > > > invalidated
> > > > > that's fine, they are not that's fine too).
> > > > >
> > > >
> > > > I don't mind doing that, but there is no benefit in making slots
> > > > active unless we can validate them. And we will end up adding some
> > > > more checks, as in function check_slots_conflict_reason without any
> > > > advantage.
> >
> > I think that there is advantage. The pros are:
> >
> > - the test would be closer to HEAD from a behavioural point of view
> > - it's very rare to hit the corner cases: so the test would behave the same
> > as on HEAD most of the time (and when it does not that would not hurt as the
> > checks are nor done)
> > - Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
> > while keeping the slot active is able to keep it (should the slot being invalidated
> > or not). But more on that in the comment === 1 below.
> >
> > > I feel Kuroda-San's second patch is simple, and we have
> > > > fewer chances to make mistakes and easy to maintain in the future as
> > > > well.
> >
> > Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
> > fine with Kuroda-San's patch if both of you feel that it's better.
> >
> > > I have concerns for Bertrand's patch that it could introduce another timing
> > > issue. E.g., if the activated slots are not invalidated, dropping slots is keep
> > > being activated so the dropping might be fail.
> >
> > Yeah, but the drop is done with "$node_standby->psql" so that the test does not
> > produce an error. It would produce an error should we use "$node_standby->safe_psql"
> > instead.
> >
> > > I did not reproduce this but
> > > something like this can happen if we activate slots.
> >
> > You can see it that way (+ reproducer.txt):
> >
> > "
> > +       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
> > +       note "BDT: $bdt";
> > +
> >         $node_standby->psql('postgres',
> >                 qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
> > "
> >
> > You'd see the slot being active and the "$node_standby->psql" not reporting
> > any error.
> >
>
> Hmm, but adding some additional smarts also makes this test less easy
> to backpatch. I see your points related to the benefits, but I still
> mildly prefer to go with the lesser changes approach for backbranches
> patch. Normally, we don't enhance backbranches code without making
> equivalent changes in HEAD, so adding some new bugs only in
> backbranches has a lesser chance.
>

Bertrand, do you agree with the fewer changes approach (where active
slots won't be tested) for backbranches? I think now that we have
established that the vacuum full test is also prone to failure due to
race condition in the test, this is the only remaining open point.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: BUG #18815: Logical replication worker Segmentation fault
Next
From: Etsuro Fujita
Date:
Subject: Re: Typo in comment for pgstat_database_flush_cb()