Thread: sync_standbys_defined read/write race on startup

sync_standbys_defined read/write race on startup

From
"Maksim.Melnikov"
Date:

Hi,
On my windows machine I've found unstable test, it fails very rarely, but anyway, sometimes it fails.

src/test/recovery/t/009_twophase.pl

#   Failed test 'Committed prepared transaction is visible to new snapshot in standby'
#   at t/009_twophase.pl line 360.
#          got: '0'
#     expected: '2'
# Looks like you failed 1 test of 30.
t/009_twophase.pl ....................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/30 subtests

cat src/test/recovery/t/009_twophase.pl

....
$cur_primary->psql(
    'postgres', "
    SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
    CREATE TABLE t_009_tbl_standby_mvcc (id int, msg text);
    BEGIN;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (1, 'issued to ${cur_primary_name}');
    SAVEPOINT s1;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (2, 'issued to ${cur_primary_name}');
    PREPARE TRANSACTION 'xact_009_standby_mvcc';
    ");
$cur_primary->stop;
$cur_standby->restart;

# Acquire a snapshot in standby, before we commit the prepared transaction
my $standby_session =
  $cur_standby->background_psql('postgres', on_error_die => 1);
$standby_session->query_safe("BEGIN ISOLATION LEVEL REPEATABLE READ");
$psql_out =
  $standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
    "Prepared transaction not visible in standby before commit");

# Commit the transaction in primary
$cur_primary->start;
$cur_primary->psql(
    'postgres', "
SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
COMMIT PREPARED 'xact_009_standby_mvcc';
");

# Still not visible to the old snapshot
$psql_out =
  $standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
    "Committed prepared transaction not visible to old snapshot in standby");

# Is visible to a new snapshot
$standby_session->query_safe("COMMIT");
$psql_out =
  $standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '2',
    "Committed prepared transaction is visible to new snapshot in standby");

$standby_session->quit;
.....



Test check visibility of prepared transactions in standby after a restart while primary is down.
Failed assert check that changes, commited on transaction on primary node, were synchronously replicated to standby node.
In test flow we have primary server stop/start events.
Assert failure is hard to reproduce, but I found how to do it synthetically, can do it by injecting in code synthetical sleep call

--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 
        if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
        {
+               sleep(2);
                LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
                /*
                

The root cause is WalSndCtl->sync_standbys_defined, it is located in shared memory and can be updated by checkpointer process only,
in backends it can be read in case of sync replication. So we have race on postgres startup between backend, processing commit request, and checkpointer,
setting WalSndCtl->sync_standbys_defined.

Synchronous replication occurred in SyncRepWaitForLSN(syncrep.c)

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
.....

    /*
     * Fast exit if user has not requested sync replication, or there are no
     * sync replication standby names defined.
     *
     * Since this routine gets called every commit time, it's important to
     * exit quickly if sync replication is not requested. So we check
     * WalSndCtl->sync_standbys_defined flag without the lock and exit
     * immediately if it's false. If it's true, we need to check it again
     * later while holding the lock, to check the flag and operate the sync
     * rep queue atomically. This is necessary to avoid the race condition
     * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
     * it's false, the lock is not necessary because we don't touch the queue.
     */
    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;

.....

    if (!WalSndCtl->sync_standbys_defined ||
        lsn <= WalSndCtl->lsn[mode])
    {
        LWLockRelease(SyncRepLock);
        return;
    }
    
...........


Checkpointer set WalSndCtl->sync_standbys_defined parameter in src/backend/replication/syncrep.c,
SyncRepUpdateSyncStandbysDefined method

void
SyncRepUpdateSyncStandbysDefined(void)
{
    bool        sync_standbys_defined = SyncStandbysDefined();

    if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
    {
        LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
..................
        WalSndCtl->sync_standbys_defined = sync_standbys_defined;

        LWLockRelease(SyncRepLock);
    }
}

So my question is
can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload before full config initializing.
Attached possible patch for codebase.

Best regards
Melnikov Maksim

Attachment

Re: sync_standbys_defined read/write race on startup

From
Kirill Reshke
Date:
On Wed, 26 Mar 2025 at 21:18, Maksim.Melnikov <m.melnikov@postgrespro.ru> wrote:
>

> So my question is
> can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
> P.S.
> For me it is looking strange, that postgres start processing workload before full config initializing.
> Attached possible patch for codebase.
>
> Best regards
> Melnikov Maksim

We do not tolerate TAP test flaps. Your repro is OK, can we have
deterministic test here using injection point technique?



-- 
Best regards,
Kirill Reshke



Re: sync_standbys_defined read/write race on startup

From
Michael Paquier
Date:
On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote:
> Test check visibility of prepared transactions in standby after a restart
> while primary is down.
> Failed assert check that changes, commited on transaction on primary node,
> were synchronously replicated to standby node.
> In test flow we have primary server stop/start events.
> Assert failure is hard to reproduce, but I found how to do it synthetically,
> can do it by injecting in code synthetical sleep call
>
> --- a/src/backend/replication/syncrep.c
> +++ b/src/backend/replication/syncrep.c
> @@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
>
>         if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
>         {
> +               sleep(2);
>                 LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
>
>                 /*

Nice investigation to find out that this specific path faces a race
here.  Tests on Windows are usually slower and good at catching these.
It may be possible that the buildfarm has reported that, actually.

> So my question is
> can we accept such behavior as bug of postgres codebase, or, on other way,
> it is bug of unstable test?
> P.S.
> For me it is looking strange, that postgres start processing workload before
> full config initializing.
> Attached possible patch for codebase.

Yeah, I agree that it is a bug, and your approach to be more careful
in SyncRepWaitForLSN() so as we wait until the configuration is loaded
and set from the checkpointer sounds like the sensible thing to do,
because the backends have no idea yet what they should do as the
configuration is not loaded.

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it.  It may be simpler to use a separate boolean flag for this
purpose.  WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait.  This race condition is worth
having a test for.  You could define the point of where you have added
your hardcoded sleep(), for example.
--
Michael

Attachment

Re: sync_standbys_defined read/write race on startup

From
"Maksim.Melnikov"
Date:


On 27.03.2025 01:13, Michael Paquier wrote:
Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it.  It may be simpler to use a separate boolean flag for this
purpose.  WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Thanks for your responses.
Yes, I agree that it looks more complicated then addition new bool field,
but there is place in SyncRepWaitForLSN method, where we check
WalSndCtlData->sync_standbys_defined out of acquire/release block.

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
    int            mode;
...............................................
    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;

..........................

If we just add new bool field, we can't atomically check sync_standbys_defined flag
and new bool field, only in lock, so we lose fast exit optimization. But I agree
that in patch I lost SYNCSTANDBYSDEFINED setting, thanks for attentiveness,
so I've attached new patch with fix,also I've done bits setting more evident. 

Anyway, if you still thinks that my arguments not enough, please notify, I will do patch with separate flag.

On 27.03.2025 01:13, Michael Paquier wrote:
Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait.  This race condition is worth
having a test for.  You could define the point of where you have added
your hardcoded sleep(), for example.
Sorry, but for me it seems that we can't use injection points toolset here, because we are talking
about startup process, and we can't call injection_points_attach on client backend before checkpointer.

Anyway, if you know how to do it, please share details, I will do.


Best regards, Maksim Melnikov

Attachment