Thread: sync_standbys_defined read/write race on startup
Hi,
On my windows machine I've found unstable test, it fails very rarely, but anyway, sometimes it fails.
src/test/recovery/t/009_twophase.pl
# Failed test 'Committed prepared transaction is visible to new snapshot in standby'
# at t/009_twophase.pl line 360.
# got: '0'
# expected: '2'
# Looks like you failed 1 test of 30.
t/009_twophase.pl ....................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/30 subtests
cat src/test/recovery/t/009_twophase.pl
....
$cur_primary->psql(
'postgres', "
SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
CREATE TABLE t_009_tbl_standby_mvcc (id int, msg text);
BEGIN;
INSERT INTO t_009_tbl_standby_mvcc VALUES (1, 'issued to ${cur_primary_name}');
SAVEPOINT s1;
INSERT INTO t_009_tbl_standby_mvcc VALUES (2, 'issued to ${cur_primary_name}');
PREPARE TRANSACTION 'xact_009_standby_mvcc';
");
$cur_primary->stop;
$cur_standby->restart;
# Acquire a snapshot in standby, before we commit the prepared transaction
my $standby_session =
$cur_standby->background_psql('postgres', on_error_die => 1);
$standby_session->query_safe("BEGIN ISOLATION LEVEL REPEATABLE READ");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
"Prepared transaction not visible in standby before commit");
# Commit the transaction in primary
$cur_primary->start;
$cur_primary->psql(
'postgres', "
SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
COMMIT PREPARED 'xact_009_standby_mvcc';
");
# Still not visible to the old snapshot
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '0',
"Committed prepared transaction not visible to old snapshot in standby");
# Is visible to a new snapshot
$standby_session->query_safe("COMMIT");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM t_009_tbl_standby_mvcc");
is($psql_out, '2',
"Committed prepared transaction is visible to new snapshot in standby");
$standby_session->quit;
.....
Test check visibility of prepared transactions in standby after a restart while primary is down.
Failed assert check that changes, commited on transaction on primary node, were synchronously replicated to standby node.
In test flow we have primary server stop/start events.
Assert failure is hard to reproduce, but I found how to do it synthetically, can do it by injecting in code synthetical sleep call
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
{
+ sleep(2);
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
/*
The root cause is WalSndCtl->sync_standbys_defined, it is located in shared memory and can be updated by checkpointer process only,
in backends it can be read in case of sync replication. So we have race on postgres startup between backend, processing commit request, and checkpointer,
setting WalSndCtl->sync_standbys_defined.
Synchronous replication occurred in SyncRepWaitForLSN(syncrep.c)
void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
.....
/*
* Fast exit if user has not requested sync replication, or there are no
* sync replication standby names defined.
*
* Since this routine gets called every commit time, it's important to
* exit quickly if sync replication is not requested. So we check
* WalSndCtl->sync_standbys_defined flag without the lock and exit
* immediately if it's false. If it's true, we need to check it again
* later while holding the lock, to check the flag and operate the sync
* rep queue atomically. This is necessary to avoid the race condition
* described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
* it's false, the lock is not necessary because we don't touch the queue.
*/
if (!SyncRepRequested() ||
!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
return;
.....
if (!WalSndCtl->sync_standbys_defined ||
lsn <= WalSndCtl->lsn[mode])
{
LWLockRelease(SyncRepLock);
return;
}
...........
Checkpointer set WalSndCtl->sync_standbys_defined parameter in src/backend/replication/syncrep.c,
SyncRepUpdateSyncStandbysDefined method
void
SyncRepUpdateSyncStandbysDefined(void)
{
bool sync_standbys_defined = SyncStandbysDefined();
if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
..................
WalSndCtl->sync_standbys_defined = sync_standbys_defined;
LWLockRelease(SyncRepLock);
}
}
So my question is
can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload before full config initializing.
Attached possible patch for codebase.
Best regards
Melnikov Maksim
Attachment
On Wed, 26 Mar 2025 at 21:18, Maksim.Melnikov <m.melnikov@postgrespro.ru> wrote: > > So my question is > can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test? > P.S. > For me it is looking strange, that postgres start processing workload before full config initializing. > Attached possible patch for codebase. > > Best regards > Melnikov Maksim We do not tolerate TAP test flaps. Your repro is OK, can we have deterministic test here using injection point technique? -- Best regards, Kirill Reshke
On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote: > Test check visibility of prepared transactions in standby after a restart > while primary is down. > Failed assert check that changes, commited on transaction on primary node, > were synchronously replicated to standby node. > In test flow we have primary server stop/start events. > Assert failure is hard to reproduce, but I found how to do it synthetically, > can do it by injecting in code synthetical sleep call > > --- a/src/backend/replication/syncrep.c > +++ b/src/backend/replication/syncrep.c > @@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void) > > if (sync_standbys_defined != WalSndCtl->sync_standbys_defined) > { > + sleep(2); > LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > > /* Nice investigation to find out that this specific path faces a race here. Tests on Windows are usually slower and good at catching these. It may be possible that the buildfarm has reported that, actually. > So my question is > can we accept such behavior as bug of postgres codebase, or, on other way, > it is bug of unstable test? > P.S. > For me it is looking strange, that postgres start processing workload before > full config initializing. > Attached possible patch for codebase. Yeah, I agree that it is a bug, and your approach to be more careful in SyncRepWaitForLSN() so as we wait until the configuration is loaded and set from the checkpointer sounds like the sensible thing to do, because the backends have no idea yet what they should do as the configuration is not loaded. Tracking if the configuration has been properly loaded in WalSndCtlData makes sense here, but I am confused by the patch: you have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets it. It may be simpler to use a separate boolean flag for this purpose. WalSndCtlData is a private structure holding the state of the WAL senders internally, so ABI does not stress me much here (flexible array at the end of the structure, as well..). Also mentioned by Kirill downthread: let's add a regression test with an injection point that does a wait. This race condition is worth having a test for. You could define the point of where you have added your hardcoded sleep(), for example. -- Michael
Attachment
Tracking if the configuration has been properly loaded in WalSndCtlData makes sense here, but I am confused by the patch: you have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets it. It may be simpler to use a separate boolean flag for this purpose. WalSndCtlData is a private structure holding the state of the WAL senders internally, so ABI does not stress me much here (flexible array at the end of the structure, as well..).
Thanks for your responses.
Yes, I agree that it looks more complicated then addition new bool field,
but there is place in SyncRepWaitForLSN method, where we check
WalSndCtlData->sync_standbys_defined out of acquire/release block.
void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
int mode;
...............................................
if (!SyncRepRequested() ||
!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
return;
..........................
If we just add new bool field, we can't atomically check sync_standbys_defined flag
and new bool field, only in lock, so we lose fast exit optimization. But I agree
that in patch I lost SYNCSTANDBYSDEFINED setting, thanks for attentiveness,
so I've attached new patch with fix,also I've done bits setting more evident.
Anyway, if you still thinks that my arguments not enough, please notify, I will do patch with separate flag.
Sorry, but for me it seems that we can't use injection points toolset here, because we are talkingAlso mentioned by Kirill downthread: let's add a regression test with an injection point that does a wait. This race condition is worth having a test for. You could define the point of where you have added your hardcoded sleep(), for example.
about startup process, and we can't call injection_points_attach on client backend before checkpointer.
Anyway, if you know how to do it, please share details, I will do.
Best regards, Maksim Melnikov
Attachment
Hi everyone, thanks for your comments. I've just wanted to share little cosmetic fixes for previous patch. Best Regards, Maksim Melnikov
Attachment
On Mon, Apr 07, 2025 at 11:39:23PM +0300, Maksim.Melnikov wrote: > Hi everyone, thanks for your comments. > I've just wanted to share little cosmetic > fixes for previous patch. Note: I would suggest to use a version number for the patches, rather than a 0003 which usually implies that this is third patch in a series. So I would use a command like that when generating a v4 of this single patch: git format-patch -1 -v4 `git am` is OK with what you have sent, but it's just a bit confusing. Anyway, back to the patch.. I am still playing with the patch and will most probably send an update tomorrow, but the injection point technique should be rather straight-forward. You have done most of the work by finding a problematic case in 009_twophase.pl, so I would suggest the following: - Instead of the sleep(), add a macro INJECTION_POINT() with a new name. - Then implement a TAP test that reproduces the failure by adding a "wait" point, combined with a "wakeup", probably with a BackgroundPsql object to keep the problematic SQL sequence alive while it waits. I would suggest to look at what we do with create-restart-point in the recovery test 041_checkpoint_at_promote.pl as a starting point. For background sessions, also look at 007_catcache_inval.pl in test_misc. The new test may be better in 009_twophase.pl, conditional depending on the existence of the extension "injection_points". You can check how it is done in the other TAP tests; these rely on check_extension() and $ENV{enable_injection_points}. -- Michael
Attachment
Hi Michael, thanks for your answer. Test 041_checkpoint_at_promote.pl is really good example of using injection points, but anyway, I still don't see how injection points can help us here. In failed test case we started postgres, immediately open psql connection and commit prepared transaction. Postgres, on start, before accepting connections, fork checkpointer and checkpointer initialize shared parameter sync_standbys_defined in CheckpointerMain, before process loop. So, in most attempts, we can't call injection_points_attach in psql before macro INJECTION_POINT() code will be executed in checkpointer(I mean INJECTION_POINT() code instead sleep()), bacause checkpointer process is forking before database system is ready to accept connections. The manual execution of checkpoint can't help us too. P.S. sorry for style errors, I am new in postgres, so can miss some rules. Anyway, to avoid any community rules mistakes, attached patch with correct naming. Thanks. Best regards, Maksim Melnikov
Attachment
On Wed, Apr 09, 2025 at 06:21:50PM +0300, Maksim.Melnikov wrote: > Test 041_checkpoint_at_promote.pl is really good example > of using injection points, but anyway, I still don't see > how injection points can help us here. In failed test case > we started postgres, immediately open psql connection and commit > prepared transaction. Postgres, on start, before accepting > connections, fork checkpointer and checkpointer initialize > shared parameter sync_standbys_defined in CheckpointerMain, > before process loop. So, in most attempts, we can't call > injection_points_attach in psql before > macro INJECTION_POINT() code will be executed in > checkpointer(I mean INJECTION_POINT() code instead sleep()), > bacause checkpointer process is forking before database > system is ready to accept connections. Confirmed. One thing where it would be possible to make things work is to introduce some persistency of the injection points, say these are flushed at shutdown. We could do that without touching at the backend code and only in the module injection_points, but perhaps this thread is not enough to justify this extra complexity. > The manual execution of checkpoint can't help us too. > > P.S. > sorry for style errors, I am new in postgres, so can miss > some rules. Anyway, to avoid any community rules mistakes, > attached patch with correct naming. No worries, we are all here to learn, as am I. + sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED + ? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : SyncStandbysDefined(); So this is the core point of the patch: if the sync standby data has not been initialized, we try the only thing we can do with a check on the GUC value. Then: - If the GUC has no value, then we know for sure that there is no point to wait, so the patch skips the wait. - If the GUC has a value, it may be possible that we have to wait but we are not sure yet without letting the checkpointer process the GUC and update the information in shared memory, so we decide that it's better to wait and let the checkpointer process the setup until it wakes up the queue, relying on the fact that there should be no processes that wait as long as the GUC is not set. This is mentioned in your commit message but there is close to zero explanation about the reason of this choice, and even worse this fact is hidden to the reader of the code because of the overly-complex setup around sync_standbys_defined and the fact that there is no comment to document this fact. +/* + * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not, + * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized. + */ +#define SYNCSTANDBYSINITIALIZED (1 << 0) +#define SYNCSTANDBYSDEFINED (1 << 1) + +#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) \ + (sync_standbys | SYNCSTANDBYSINITIALIZED) +#define SET_SYNCSTANDBYS_DEFINED(sync_standbys) \ + (sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED) +#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys) \ + (SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED)) +#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys) \ + ((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == SYNCSTANDBYSINITIALIZED) + The use of the flags is a good idea to protect the atomicity of the lookup in the fast path, but hiding that in a set of macros, with some of them being used in only one location reduces the readability of the code in walsender.c and syncrep.c because we would need to do a permanent mapping with what walsender_private.h uses. The reset one depends on the set one, which is also confusing. A couple of comments in walsender.c and syncrep.c also need a refresh around the manipulations of sync_standbys_defined. Also, the name "sync_standbys_defined" is not adapted anymore, as it stores a status of the data in shared memory associated to the sync standbys, so I've switched that to sync_standbys_status, with bits8 as type so as we have the same size as a boolean. I have taken a stab at all that, finishing with the attached. The patch has a pg_usleep() that I have used to stress the CI. That should be removed in the final result, obviously. Side note: I am pretty sure that this issue has been the origin of some spurious failures in the buildfarm. I cannot point my finger to a particular one now, but it would not go unnoticed especially on the slow animals.. What do you think? -- Michael
Attachment
On 10.04.2025 05:25, Michael Paquier wrote: > Confirmed. One thing where it would be possible to make things work > is to introduce some persistency of the injection points, say these > are flushed at shutdown. We could do that without touching at the > backend code and only in the module injection_points, but perhaps this > thread is not enough to justify this extra complexity. Agree, it should be discussed in another thread. On 10.04.2025 05:25, Michael Paquier wrote: > What do you think? Yes, patch for me looks ok, only one suggestion, I am not sure but I am afraid we are loosing this. if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) { if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 || lsn <= WalSndCtl->lsn[mode]) { LWLockRelease(SyncRepLock); return; } } else if (!SyncStandbysDefined()) { /* * If we are here, the sync standby data has not been initialized yet, * so fall back to the best thing we can do: a check on * SyncStandbysDefined() to see if the GUC is set or not. * * If the GUC has a value, we wait until the checkpointer updates the * status data because we cannot be sure yet if we should wait or not. * If the GUC has *no* value, we are sure that there is no point to * wait; this matters for example when initializing a cluster, where * we should never wait, and no sync standbys is the default behavior. */ LWLockRelease(SyncRepLock); return; } + else if (lsn <= WalSndCtl->lsn[mode]) + { + LWLockRelease(SyncRepLock); + return; + } What do you think? Best regards, Maksim Melnikov
On Thu, Apr 10, 2025 at 10:12:34AM +0300, Maksim.Melnikov wrote: > but I am afraid we are loosing this. > > if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) > { > if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 || > lsn <= WalSndCtl->lsn[mode]) > { > LWLockRelease(SyncRepLock); > return; > } > } > [...] > + else if (lsn <= WalSndCtl->lsn[mode]) > + { > + LWLockRelease(SyncRepLock); > + return; > + } > > What do you think? Hmm, yeah. Instead of last, it would be better to put it in second place perhaps, for clarity? That would be the same at the end, but we would be slightly more consistent with the past logic regarding the ordering. Does that look OK to you? -- Michael
Attachment
On 10.04.2025 12:15, Michael Paquier wrote: > Hmm, yeah. Instead of last, it would be better to put it in second > place perhaps, for clarity? That would be the same at the end, but we > would be slightly more consistent with the past logic regarding the > ordering. Does that look OK to you? Yes, from my point of view it looks fine. Best regards, Maksim Melnikov
On Thu, Apr 10, 2025 at 12:55:54PM +0300, Maksim.Melnikov wrote: > On 10.04.2025 12:15, Michael Paquier wrote: >> Hmm, yeah. Instead of last, it would be better to put it in second >> place perhaps, for clarity? That would be the same at the end, but we >> would be slightly more consistent with the past logic regarding the >> ordering. Does that look OK to you? > > Yes, from my point of view it looks fine. Thanks for the double-check. I've played a couple more hours with the startup case, like playing with s_s_names set but uninitialized in shmem while stucking backends, and that seems OK, so applied down to v13. Let's see how it goes.. -- Michael