Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Date
Msg-id ZaDnIMTbT+0yR5LZ@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
List pgsql-hackers
Hi,

On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote:
> On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> > Bertrand, I've relaunched tests in the same slowed down VM with both
> > patches applied (but with no other modifications) and got a failure
> > with pg_class, similar to what we had seen before:
> > 9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> > 9       #   at t/035_standby_logical_decoding.pl line 230.
> > 
> > Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> > 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Thanks! 

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
advanced and there is no conflict).

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the
catalog_xmin of the active slot.

> Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
> = minimal, because these lead to unpredictible records inserted,
> impacting the reliability of the tests.  We cannot do that here,
> obviously.  That may be a long shot, but could it be possible to tweak
> the test with a retry logic, retrying things if such a standby
> snapshot is found because we know that the invalidation is not going
> to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select txid_current()"
that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached is
now doing.

Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Masahiko Sawada
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum