Re: Fix 035_standby_logical_decoding.pl race conditions - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Fix 035_standby_logical_decoding.pl race conditions
Date
Msg-id Z+z3Hs765Ixo0mN7@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to RE: Fix 035_standby_logical_decoding.pl race conditions  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: Fix 035_standby_logical_decoding.pl race conditions
Re: Fix 035_standby_logical_decoding.pl race conditions
List pgsql-hackers
Hi Kuroda-san,

On Wed, Apr 02, 2025 at 07:16:25AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Amit, Bertrand,
> 
> > You have not added any injection point for the above case. Isn't it
> > possible that if running_xact record is logged concurrently to the
> > pruning record, it should move the active slot on standby, and the
> > same failure should occur in this case as well?
> 
> I considered that the timing failure can happen. Reproducer:
> 
> ```
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', 'CHECKPOINT');
> +sleep(20);
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> ```

Yeah, I was going to provide the exact same reproducer and then saw your email.

> Based on the fact, I've updated to use injection_points for scenario 5. Of course,
> PG16/17 patches won't use the active slot for that scenario.

Thanks for the updated patch!

As far v4-0001:

=== 1

+# would advance an active replication slot's catalog_xmin

s/would/could/? I mean the system also needs to be "slow" enough (so the
sleep() in the reproducer)

=== 2

+# Injection_point is used to avoid seeing an xl_running_xacts even here. In
+# scenario 5, we verify the case that the backend process detects the page has
+# enough tuples; thus, page pruning happens. If the record is generated just
+# before doing on-pruning, the catalog_xmin of the active slot would be
+# updated; hence, the conflict would not occur.

Not sure we need to explain what scenario 5 does here, but that does not hurt
if you feel the need.

Also maybe mention the last update in the comment and add some nuance (like
proposed in === 1), something like?

"
# Injection_point is used to avoid seeing a xl_running_xacts here. Indeed,
# if it is generated between the last 2 updates then the catalog_xmin of the active
# slot could be updated; hence, the conflict could not occur.
"

Apart from that the tests looks good to me and all the problematic scenarios
covered.

As far PG17-v4-0001:

=== 3

-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's

why?

=== 4

It looks like that check_slots_conflict_reason() is not called with checks_active_slot
as argument.

=== 5

I think that we could remove the need for the drop_active_slot parameter in
drop_logical_slots() and just check if an active slot exists (and if so drop
it). That said I'm not sure it's worth to go that far for backpatching.

As far PG16-v4:

=== 6

Same as === 3 and === 5 (=== 4 does not apply as check_slots_conflict_reason()
does not exist).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
Next
From: Alvaro Herrera
Date:
Subject: Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints