Thread: sync_standbys_defined read/write race on startup

sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

26 March, 17:17:17

Hi,
On my windows machine I've found unstable test, it fails very rarely, but anyway, sometimes it fails.

src/test/recovery/t/009_twophase.pl

#   Failed test 'Committed prepared transaction is visible to new snapshot in standby'
#   at t/009_twophase.pl line 360.
#          got: '0'
#     expected: '2'
# Looks like you failed 1 test of 30.
t/009_twophase.pl ....................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/30 subtests

cat src/test/recovery/t/009_twophase.pl

....
$cur_primary->psql(
    'postgres', "
    SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
    CREATE TABLE t_009_tbl_standby_mvcc (id int, msg text);
    BEGIN;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (1, 'issued to ${cur_primary_name}');
    SAVEPOINT s1;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (2, 'issued to ${cur_primary_name}');
    PREPARE TRANSACTION 'xact_009_standby_mvcc';
    ");
$cur_primary->stop;
$cur_standby->restart;

# Acquire a snapshot in standby, before we commit the prepared transaction
my $standby_session =
$cur_standby->background_psql('postgres', on_error_die => 1);
$standby_session->query_safe("BEGIN ISOLATION LEVEL REPEATABLE READ");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
    "Prepared transaction not visible in standby before commit");

# Commit the transaction in primary
$cur_primary->start;
$cur_primary->psql(
    'postgres', "
SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
COMMIT PREPARED 'xact_009_standby_mvcc';
");

# Still not visible to the old snapshot
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
    "Committed prepared transaction not visible to old snapshot in standby");

# Is visible to a new snapshot
$standby_session->query_safe("COMMIT");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '2',
    "Committed prepared transaction is visible to new snapshot in standby");
$standby_session->quit;
.....

Test check visibility of prepared transactions in standby after a restart while primary is down.
Failed assert check that changes, commited on transaction on primary node, were synchronously replicated to standby node.
In test flow we have primary server stop/start events.
Assert failure is hard to reproduce, but I found how to do it synthetically, can do it by injecting in code synthetical sleep call

--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)

        if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
        {
+               sleep(2);
                LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);

                /*


The root cause is WalSndCtl->sync_standbys_defined, it is located in shared memory and can be updated by checkpointer process only,
in backends it can be read in case of sync replication. So we have race on postgres startup between backend, processing commit request, and checkpointer,
setting WalSndCtl->sync_standbys_defined.

Synchronous replication occurred in SyncRepWaitForLSN(syncrep.c)

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
.....

    /*
     * Fast exit if user has not requested sync replication, or there are no
     * sync replication standby names defined.
     *
     * Since this routine gets called every commit time, it's important to
     * exit quickly if sync replication is not requested. So we check
     * WalSndCtl->sync_standbys_defined flag without the lock and exit
     * immediately if it's false. If it's true, we need to check it again
     * later while holding the lock, to check the flag and operate the sync
     * rep queue atomically. This is necessary to avoid the race condition
     * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
     * it's false, the lock is not necessary because we don't touch the queue.
     */
    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;

.....

    if (!WalSndCtl->sync_standbys_defined ||
        lsn <= WalSndCtl->lsn[mode])
    {
        LWLockRelease(SyncRepLock);
        return;
    }

...........

Checkpointer set WalSndCtl->sync_standbys_defined parameter in src/backend/replication/syncrep.c,
SyncRepUpdateSyncStandbysDefined method

void
SyncRepUpdateSyncStandbysDefined(void)
{
    bool        sync_standbys_defined = SyncStandbysDefined();

    if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
    {
        LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
..................
        WalSndCtl->sync_standbys_defined = sync_standbys_defined;

        LWLockRelease(SyncRepLock);
    }
}

So my question is
can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload before full config initializing.
Attached possible patch for codebase.

Best regards
Melnikov Maksim

Attachment

0001-Fix-sync_standbys_defined-race-on-startup.patch

Re: sync_standbys_defined read/write race on startup

From

Kirill Reshke

Date:

26 March, 22:48:54

On Wed, 26 Mar 2025 at 21:18, Maksim.Melnikov <m.melnikov@postgrespro.ru> wrote:
>

> So my question is
> can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
> P.S.
> For me it is looking strange, that postgres start processing workload before full config initializing.
> Attached possible patch for codebase.
>
> Best regards
> Melnikov Maksim

We do not tolerate TAP test flaps. Your repro is OK, can we have
deterministic test here using injection point technique?



-- 
Best regards,
Kirill Reshke

Re: sync_standbys_defined read/write race on startup

From

Michael Paquier

Date:

27 March, 01:13:08

On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote:
> Test check visibility of prepared transactions in standby after a restart
> while primary is down.
> Failed assert check that changes, commited on transaction on primary node,
> were synchronously replicated to standby node.
> In test flow we have primary server stop/start events.
> Assert failure is hard to reproduce, but I found how to do it synthetically,
> can do it by injecting in code synthetical sleep call
>
> --- a/src/backend/replication/syncrep.c
> +++ b/src/backend/replication/syncrep.c
> @@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
>
>         if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
>         {
> +               sleep(2);
>                 LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
>
>                 /*

Nice investigation to find out that this specific path faces a race
here.  Tests on Windows are usually slower and good at catching these.
It may be possible that the buildfarm has reported that, actually.

> So my question is
> can we accept such behavior as bug of postgres codebase, or, on other way,
> it is bug of unstable test?
> P.S.
> For me it is looking strange, that postgres start processing workload before
> full config initializing.
> Attached possible patch for codebase.

Yeah, I agree that it is a bug, and your approach to be more careful
in SyncRepWaitForLSN() so as we wait until the configuration is loaded
and set from the checkpointer sounds like the sensible thing to do,
because the backends have no idea yet what they should do as the
configuration is not loaded.

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it.  It may be simpler to use a separate boolean flag for this
purpose.  WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait.  This race condition is worth
having a test for.  You could define the point of where you have added
your hardcoded sleep(), for example.
--
Michael

Attachment

signature.asc

Re: sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

28 March, 10:57:58

On 27.03.2025 01:13, Michael Paquier wrote:

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it.  It may be simpler to use a separate boolean flag for this
purpose.  WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Thanks for your responses.
Yes, I agree that it looks more complicated then addition new bool field,
but there is place in SyncRepWaitForLSN method, where we check
WalSndCtlData->sync_standbys_defined out of acquire/release block.

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
    int            mode;
...............................................
    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;
..........................

If we just add new bool field, we can't atomically check sync_standbys_defined flag
and new bool field, only in lock, so we lose fast exit optimization. But I agree
that in patch I lost SYNCSTANDBYSDEFINED setting, thanks for attentiveness,
so I've attached new patch with fix,also I've done bits setting more evident.

Anyway, if you still thinks that my arguments not enough, please notify, I will do patch with separate flag.

On 27.03.2025 01:13, Michael Paquier wrote:

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait.  This race condition is worth
having a test for.  You could define the point of where you have added
your hardcoded sleep(), for example.

Sorry, but for me it seems that we can't use injection points toolset here, because we are talking
about startup process, and we can't call injection_points_attach on client backend before checkpointer.

Anyway, if you know how to do it, please share details, I will do.

Best regards, Maksim Melnikov

Attachment

0002-Fix-sync_standbys_defined-race-on-startup.patch

Re: sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

07 April, 23:39:23

Hi everyone, thanks for your comments.
I've just wanted to share little cosmetic
fixes for previous patch.

Best Regards,
Maksim Melnikov

Attachment

0003-Fix-sync_standbys_defined-race-on-startup.patch

Re: sync_standbys_defined read/write race on startup

From

Michael Paquier

Date:

09 April, 10:45:57

On Mon, Apr 07, 2025 at 11:39:23PM +0300, Maksim.Melnikov wrote:
> Hi everyone, thanks for your comments.
> I've just wanted to share little cosmetic
> fixes for previous patch.

Note: I would suggest to use a version number for the patches, rather
than a 0003 which usually implies that this is third patch in a
series.  So I would use a command like that when generating a v4 of
this single patch:
git format-patch -1 -v4

`git am` is OK with what you have sent, but it's just a bit confusing.
Anyway, back to the patch..

I am still playing with the patch and will most probably send an
update tomorrow, but the injection point technique should be rather
straight-forward.  You have done most of the work by finding a
problematic case in 009_twophase.pl, so I would suggest the following:
- Instead of the sleep(), add a macro INJECTION_POINT() with a new
name.
- Then implement a TAP test that reproduces the failure by adding a
"wait" point, combined with a "wakeup", probably with a BackgroundPsql
object to keep the problematic SQL sequence alive while it waits.

I would suggest to look at what we do with create-restart-point in the
recovery test 041_checkpoint_at_promote.pl as a starting point.  For
background sessions, also look at 007_catcache_inval.pl in test_misc.
The new test may be better in 009_twophase.pl, conditional depending
on the existence of the extension "injection_points".  You can check
how it is done in the other TAP tests; these rely on check_extension()
and $ENV{enable_injection_points}.
--
Michael

Attachment

signature.asc

Re: sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

09 April, 18:21:50

Hi Michael, thanks for your answer.

Test 041_checkpoint_at_promote.pl is really good example
of using injection points, but anyway, I still don't see
how injection points can help us here. In failed test case
we started postgres, immediately open psql connection and commit
prepared transaction. Postgres, on start, before accepting
connections, fork checkpointer and checkpointer initialize
shared parameter sync_standbys_defined in CheckpointerMain,
before process loop. So, in most attempts, we can't call
injection_points_attach in psql before
macro INJECTION_POINT() code will be executed in
checkpointer(I mean INJECTION_POINT() code instead sleep()),
bacause checkpointer process is forking before database
system is ready to accept connections.

The manual execution of checkpoint can't help us too.

P.S.
sorry for style errors, I am new in postgres, so can miss
some rules. Anyway, to avoid any community rules mistakes,
attached patch with correct naming.

Thanks.

Best regards,
Maksim Melnikov

Attachment

v4-0001-Fix-sync_standbys_defined-race-on-startup.patch

Re: sync_standbys_defined read/write race on startup

From

Michael Paquier

Date:

10 April, 05:25:38

On Wed, Apr 09, 2025 at 06:21:50PM +0300, Maksim.Melnikov wrote:
> Test 041_checkpoint_at_promote.pl is really good example
> of using injection points, but anyway, I still don't see
> how injection points can help us here. In failed test case
> we started postgres, immediately open psql connection and commit
> prepared transaction. Postgres, on start, before accepting
> connections, fork checkpointer and checkpointer initialize
> shared parameter sync_standbys_defined in CheckpointerMain,
> before process loop. So, in most attempts, we can't call
> injection_points_attach in psql before
> macro INJECTION_POINT() code will be executed in
> checkpointer(I mean INJECTION_POINT() code instead sleep()),
> bacause checkpointer process is forking before database
> system is ready to accept connections.

Confirmed.  One thing where it would be possible to make things work
is to introduce some persistency of the injection points, say these
are flushed at shutdown.  We could do that without touching at the
backend code and only in the module injection_points, but perhaps this
thread is not enough to justify this extra complexity.

> The manual execution of checkpoint can't help us too.
>
> P.S.
> sorry for style errors, I am new in postgres, so can miss
> some rules. Anyway, to avoid any community rules mistakes,
> attached patch with correct naming.

No worries, we are all here to learn, as am I.

+    sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+        ? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : SyncStandbysDefined();

So this is the core point of the patch: if the sync standby data has
not been initialized, we try the only thing we can do with a check on
the GUC value.  Then:
- If the GUC has no value, then we know for sure that there is no
point to wait, so the patch skips the wait.
- If the GUC has a value, it may be possible that we have to wait but
we are not sure yet without letting the checkpointer process the GUC
and update the information in shared memory, so we decide that it's
better to wait and let the checkpointer process the setup until it
wakes up the queue, relying on the fact that there should be no
processes that wait as long as the GUC is not set.

This is mentioned in your commit message but there is close to zero
explanation about the reason of this choice, and even worse this fact
is hidden to the reader of the code because of the overly-complex
setup around sync_standbys_defined and the fact that there is no
comment to document this fact.

+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED    (1 << 0)
+#define SYNCSTANDBYSDEFINED        (1 << 1)
+
+#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED)
+#define SET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)
+#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED))
+#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys)    \
+    ((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == SYNCSTANDBYSINITIALIZED)
+

The use of the flags is a good idea to protect the atomicity of the
lookup in the fast path, but hiding that in a set of macros, with some
of them being used in only one location reduces the readability of the
code in walsender.c and syncrep.c because we would need to do a
permanent mapping with what walsender_private.h uses.  The reset one
depends on the set one, which is also confusing.

A couple of comments in walsender.c and syncrep.c also need a refresh
around the manipulations of sync_standbys_defined.

Also, the name "sync_standbys_defined" is not adapted anymore, as it
stores a status of the data in shared memory associated to the sync
standbys, so I've switched that to sync_standbys_status, with bits8 as
type so as we have the same size as a boolean.

I have taken a stab at all that, finishing with the attached.  The
patch has a pg_usleep() that I have used to stress the CI.  That
should be removed in the final result, obviously.

Side note: I am pretty sure that this issue has been the origin of
some spurious failures in the buildfarm.  I cannot point my finger to
a particular one now, but it would not go unnoticed especially on the
slow animals..

What do you think?
--
Michael

Attachment

Re: sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

10 April, 10:12:34

On 10.04.2025 05:25, Michael Paquier wrote:
> Confirmed.  One thing where it would be possible to make things work
> is to introduce some persistency of the injection points, say these
> are flushed at shutdown.  We could do that without touching at the
> backend code and only in the module injection_points, but perhaps this
> thread is not enough to justify this extra complexity.

Agree, it should be discussed in another thread.


On 10.04.2025 05:25, Michael Paquier wrote:
> What do you think?

Yes, patch for me looks ok, only one suggestion, I am not sure

but I am afraid we are loosing this.


         if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
         {
                 if ((WalSndCtl->sync_standbys_status & 
SYNC_STANDBY_DEFINED) == 0 ||
                         lsn <= WalSndCtl->lsn[mode])
                 {
                         LWLockRelease(SyncRepLock);
                         return;
                 }
         }
         else if (!SyncStandbysDefined())
         {
                 /*
                  * If we are here, the sync standby data has not been 
initialized yet,
                  * so fall back to the best thing we can do: a check on
                  * SyncStandbysDefined() to see if the GUC is set or not.
                  *
                  * If the GUC has a value, we wait until the 
checkpointer updates the
                  * status data because we cannot be sure yet if we 
should wait or not.
                  * If the GUC has *no* value, we are sure that there is 
no point to
                  * wait; this matters for example when initializing a 
cluster, where
                  * we should never wait, and no sync standbys is the 
default behavior.
                  */
                 LWLockRelease(SyncRepLock);
                 return;
         }
+       else if (lsn <= WalSndCtl->lsn[mode])
+       {
+               LWLockRelease(SyncRepLock);
+               return;
+       }

What do you think?


Best regards,

Maksim Melnikov

Re: sync_standbys_defined read/write race on startup

From

Michael Paquier

Date:

10 April, 12:15:37

On Thu, Apr 10, 2025 at 10:12:34AM +0300, Maksim.Melnikov wrote:
> but I am afraid we are loosing this.
>
>    if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
>    {
>       if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
>          lsn <= WalSndCtl->lsn[mode])
>       {
>          LWLockRelease(SyncRepLock);
>          return;
>       }
>    }
> [...]
> +  else if (lsn <= WalSndCtl->lsn[mode])
> +  {
> +     LWLockRelease(SyncRepLock);
> +     return;
> +   }
>
> What do you think?

Hmm, yeah.  Instead of last, it would be better to put it in second
place perhaps, for clarity?  That would be the same at the end, but we
would be slightly more consistent with the past logic regarding the
ordering.  Does that look OK to you?
--
Michael

Attachment

Re: sync_standbys_defined read/write race on startup

From

"Maksim.Melnikov"

Date:

10 April, 12:55:54

On 10.04.2025 12:15, Michael Paquier wrote:

> Hmm, yeah.  Instead of last, it would be better to put it in second
> place perhaps, for clarity?  That would be the same at the end, but we
> would be slightly more consistent with the past logic regarding the
> ordering.  Does that look OK to you?

Yes, from my point of view it looks fine.


Best regards,
Maksim Melnikov

Re: sync_standbys_defined read/write race on startup

From

Michael Paquier

Date:

11 April, 04:10:46

On Thu, Apr 10, 2025 at 12:55:54PM +0300, Maksim.Melnikov wrote:
> On 10.04.2025 12:15, Michael Paquier wrote:
>> Hmm, yeah.  Instead of last, it would be better to put it in second
>> place perhaps, for clarity?  That would be the same at the end, but we
>> would be slightly more consistent with the past logic regarding the
>> ordering.  Does that look OK to you?
>
> Yes, from my point of view it looks fine.

Thanks for the double-check.  I've played a couple more hours with the
startup case, like playing with s_s_names set but uninitialized in
shmem while stucking backends, and that seems OK, so applied down to
v13.  Let's see how it goes..
--
Michael

Attachment

signature.asc