Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed |
Date | |
Msg-id | Zao6xfSQUry5Fkdn@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
|
List | pgsql-hackers |
Hi, On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: > Hello, > > 15.01.2024 12:00, Alexander Lakhin wrote: > > If this approach looks promising to you, maybe we could add a submodule to > > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in > > 019_replslot_limit) as well. > > > > Personally I think that having such a functionality for using in tests > > might be useful not only to avoid some "problematic" behaviour but also to > > test the opposite cases. > > After spending a few days on it, I've discovered two more issues: > https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com > > (The latter is not related to bgwriter directly, but it was discovered > thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.) > > So it becomes clear that the 035 test is not the only one, which might > suffer from bgwriter's activity, Yeah... thanks for sharing! > and inventing a way to stop bgwriter/ > control it is a different subject, getting out of scope of the current > issue. Agree. > 15.01.2024 11:49, Bertrand Drouvot wrote: > > We did a few things in this thread, so to sum up what we've discovered: > > > > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1]) > > - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon > > ( proposal in in v6 attached) > > - we need a way to control how frequent xl_running_xacts are emmitted (to ensure > > they are not triggered in a middle of an active slot invalidation test). > > > > I'm not sure it's possible to address Tom's concern and keep the test "predictable". > > > > So, I think I'd vote for Michael's proposal to implement a superuser-settable > > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would > > still not address Tom's concern anyway). > > > > Another option would be to "sacrifice" the full predictablity of the test (in > > favor of real-world behavior testing)? > > > > [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal > > So, now we have the test 035 failing due to nondeterministic vacuum > activity in the first place, and due to bgwriter's activity in the second. Yeah, that's also my understanding. > Maybe it would be a right move to commit the fix, and then think about > more rare failures. +1 > Though I have a couple of question regarding the fix left, if you don't > mind: > 1) The test has minor defects in the comments, that I noted before [1]; > would you like to fix them in passing? > > > BTW, it looks like the comment: > > # One way to produce recovery conflict is to create/drop a relation and > > # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby. > > in scenario 3 is a copy-paste error. Nice catch, thanks! Fixed in v7 attached. > > Also, there are two "Scenario 4" in this test. D'oh! Fixed in v7. > > > > 2) Shall we align the 035_standby_logical_decoding with > 031_recovery_conflict in regard to improving stability of vacuum? Yeah, I think that could make sense. > I see the following options for this: > a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests; > b) use FREEZE and autovacuum = off in both tests; > c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and > autovacuum = off in both. > I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not sure using FREEZE would give full "stabilization predictability", at least for 035_standby_logical_decoding.pl). That said I did not test what the outcome would be for 031_recovery_conflict.pl by making use of a). > I've re-tested the v6 patch today and confirmed that it makes the test > more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in > parallel and got failures ('inactiveslot slot invalidation is logged with > vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. > (With unlimited CPU, when average test duration is around 70 seconds.) > > But with v6 applied, 60 iterations succeeded. Nice! Thanks for the testing! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: