Re: Fix 035_standby_logical_decoding.pl race conditions - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Fix 035_standby_logical_decoding.pl race conditions |
Date | |
Msg-id | Z+4x4HM9ZWlD0D8P@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | RE: Fix 035_standby_logical_decoding.pl race conditions ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
Re: Fix 035_standby_logical_decoding.pl race conditions
|
List | pgsql-hackers |
Hi, On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote: > Dear Bertrand, Amit, > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should > > > keep the slots active and only avoid doing the checks for them (they are > > invalidated > > > that's fine, they are not that's fine too). > > > > > > > I don't mind doing that, but there is no benefit in making slots > > active unless we can validate them. And we will end up adding some > > more checks, as in function check_slots_conflict_reason without any > > advantage. I think that there is advantage. The pros are: - the test would be closer to HEAD from a behavioural point of view - it's very rare to hit the corner cases: so the test would behave the same as on HEAD most of the time (and when it does not that would not hurt as the checks are nor done) - Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated" while keeping the slot active is able to keep it (should the slot being invalidated or not). But more on that in the comment === 1 below. > I feel Kuroda-San's second patch is simple, and we have > > fewer chances to make mistakes and easy to maintain in the future as > > well. Yeah maybe but the price to pay is to discard the pros above. That said, I'm also fine with Kuroda-San's patch if both of you feel that it's better. > I have concerns for Bertrand's patch that it could introduce another timing > issue. E.g., if the activated slots are not invalidated, dropping slots is keep > being activated so the dropping might be fail. Yeah, but the drop is done with "$node_standby->psql" so that the test does not produce an error. It would produce an error should we use "$node_standby->safe_psql" instead. > I did not reproduce this but > something like this can happen if we activate slots. You can see it that way (+ reproducer.txt): " + my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]); + note "BDT: $bdt"; + $node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$inactive_slot')]); " You'd see the slot being active and the "$node_standby->psql" not reporting any error. > Attached patch has a conclusion of these discussions, slots are created but > it seldomly be activated. Thanks for the patch! === 1 -$node_standby->poll_query_until('testdb', - qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot'] -) or die "replication slot stats of vacuum_full_activeslot not updated"; - # This should trigger the conflict wait_until_vacuum_can_remove( I wonder if we could not keep this test and make the slot active for the vacuum full case. Looking at drongo's failure in [1], there is no occurence of "vacuum full" and that's probably linked to Andres's explanation in [2]: " a VACUUM FULL on pg_class is used, which prevents logical decoding from progressing after it started (due to the logged AEL at the start of VACFULL). " meaning that the active slot is invalidated even if the catalog xmin's moves forward due to xl_running_xacts. [1]: https://www.postgresql.org/message-id/386386.1737736935%40sss.pgh.pa.us [2]: https://www.postgresql.org/message-id/zqypkuvtihtd2zbmwdfmcceujg4fuakrhojmjkxpp7t4udqkty%40couhenc7dsor Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: