Re: Fix race condition in InvalidatePossiblyObsoleteSlot() - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date
Msg-id Zeir1JpVsfdb7/nb@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Fix race condition in InvalidatePossiblyObsoleteSlot()  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
List pgsql-hackers
Hi,

On Wed, Mar 06, 2024 at 05:45:56PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote:
> > > Right, somehow out of context here.
> >
> > We're not yet in the green yet, one of my animals has complained:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03
> >
> > This is telling us that the xmin horizon is unchanged, and the test
> > cannot move on with the injection point wake up that would trigger the
> > following logs:
> > 2024-03-06 20:12:59.039 JST [21143] LOG:  invalidating obsolete replication slot "injection_activeslot"
> > 2024-03-06 20:12:59.039 JST [21143] DETAIL:  The slot conflicted with xid horizon 770.
> >
> > Not sure what to think about that yet.
> 
> Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't
> happy about that as well [1]. It looks like the slot's catalog_xmin on
> the standby isn't moving forward.
> 

Thank you both for the report! I did a few test manually and can see the issue
from times to times. When the issue occurs, the logical decoding was able to
go through the place where LogicalConfirmReceivedLocation() updates the
slot's catalog_xmin before being killed. In that case I can see that the
catalog_xmin is updated to the xid horizon.

Means in a failed test we have something like:

slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839." 

While when the test is ok you'll see something like:

slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

In the failing test the call to SELECT pg_logical_slot_get_changes() does
not advance the slot's catalog xmin anymore.

To fix this, I think we need a new transacion to decode from the primary before
executing pg_logical_slot_get_changes(). But this transaction has to be replayed
on the standby first by the startup process. Which means we need to wakeup
"terminate-process-holding-slot" and that we probably need another injection
point somewehere in this test.

I'll look at it unless you've another idea?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences, take 2
Next
From: Tom Lane
Date:
Subject: Re: Remove unnecessary code from psql's watch command