On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> >
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> >
> > Other test code already has a solution for this, so the attached patches add a
> > timeout and copy the existing solution. I'm also attaching the hack that
> > makes it 100% reproducible.
> I did a few tests and confirm that the proposed solution fixes the corner case.
Thanks for reviewing.
> What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? (something
> like wait_for_restart_lsn_calculation-v1.patch attached).
Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two. If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:
$standby->wait_for_subscription_starting_point($primary, $slot_name);
$primary->log_standby_snapshot($standby, $slot_name);
Would you like to finish the patch in such a way?