Re: Fix 035_standby_logical_decoding.pl race conditions - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Fix 035_standby_logical_decoding.pl race conditions
Date
Msg-id Z+4x4HM9ZWlD0D8P@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to RE: Fix 035_standby_logical_decoding.pl race conditions  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: Fix 035_standby_logical_decoding.pl race conditions
List pgsql-hackers
Hi,

On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Bertrand, Amit,
> 
> > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
> > > keep the slots active and only avoid doing the checks for them (they are
> > invalidated
> > > that's fine, they are not that's fine too).
> > >
> > 
> > I don't mind doing that, but there is no benefit in making slots
> > active unless we can validate them. And we will end up adding some
> > more checks, as in function check_slots_conflict_reason without any
> > advantage.

I think that there is advantage. The pros are:

- the test would be closer to HEAD from a behavioural point of view
- it's very rare to hit the corner cases: so the test would behave the same
as on HEAD most of the time (and when it does not that would not hurt as the
checks are nor done)
- Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
while keeping the slot active is able to keep it (should the slot being invalidated
or not). But more on that in the comment === 1 below.

> I feel Kuroda-San's second patch is simple, and we have
> > fewer chances to make mistakes and easy to maintain in the future as
> > well.

Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
fine with Kuroda-San's patch if both of you feel that it's better.

> I have concerns for Bertrand's patch that it could introduce another timing
> issue. E.g., if the activated slots are not invalidated, dropping slots is keep
> being activated so the dropping might be fail.

Yeah, but the drop is done with "$node_standby->psql" so that the test does not
produce an error. It would produce an error should we use "$node_standby->safe_psql"
instead.

> I did not reproduce this but
> something like this can happen if we activate slots.

You can see it that way (+ reproducer.txt):

"
+       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
+       note "BDT: $bdt";
+
        $node_standby->psql('postgres',
                qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
"

You'd see the slot being active and the "$node_standby->psql" not reporting
any error.

> Attached patch has a conclusion of these discussions, slots are created but
> it seldomly be activated.

Thanks for the patch!

=== 1

-$node_standby->poll_query_until('testdb',
-       qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(

I wonder if we could not keep this test and make the slot active for the
vacuum full case. Looking at drongo's failure in [1], there is no occurence
of "vacuum full" and that's probably linked to Andres's explanation in [2]:

"
a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).
"

meaning that the active slot is invalidated even if the catalog xmin's moves
forward due to xl_running_xacts.

[1]: https://www.postgresql.org/message-id/386386.1737736935%40sss.pgh.pa.us
[2]: https://www.postgresql.org/message-id/zqypkuvtihtd2zbmwdfmcceujg4fuakrhojmjkxpp7t4udqkty%40couhenc7dsor

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: pg_recvlogical cannot create slots with failover=true
Next
From: Jakub Wartak
Date:
Subject: Re: Draft for basic NUMA observability