Thread: [HACKERS] Apparent walsender bug triggered by logical replication
I've been poking into the src/test/subscription TAP tests, thinking that they seem a lot slower than they ought to be. The first thing I came across was this bit in WaitForReplicationWorkerAttach(): /* * We need timeout because we generally don't get notified via latch * about the worker attach. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, 1000L, WAIT_EVENT_BGWORKER_STARTUP); Tracing showed that the worker is generally done attaching within two or three milliseconds of the request, making the 1000ms delay in this loop a rather serious killer of startup performance. I think the right way to fix that is to arrange to have a signal sent back from the worker; but just to confirm that this *is* a bottleneck, I tried cutting the timeout way back. Setting it to 10ms indeed made the 001 and 002 subscription tests a good deal faster --- most of the time. Every second or third run or so, either the 001 or 002 test would take about 20sec longer than it ought to. When I set it to 5ms, the 001 test reliably took ten times longer than normal, and the 002 test failed completely. I've spent quite a bit of time tracing down why, but what seems to be happening is that elimination of the excess delay in WaitForReplicationWorkerAttach allows this if-test in LogicalRepSyncTableStart: /*---------- * There are now two possible states here: * a) Sync is behind the apply. If that's the case we need to * catch up with it by consuming the logical replication * stream up to the relstate_lsn. For that, we exit this * function and continue in ApplyWorkerMain(). * b) Sync is caught up with the apply. So it can just set * the state to SYNCDONE and finish. *---------- */ if (*origin_startpos >= MyLogicalRepWorker->relstate_lsn) to sometimes fail, whereas it basically never does with the delay, because the launcher never gets there fast enough to advance relstate_lsn past what the synchronization worker had set before the worker gets here and decides it can exit. That means we do what the comment calls (a), that is, return to ApplyWorkerMain, start up streaming replication, and enter LogicalRepApplyLoop. But, almost immediately, we reach the synchronization point and decide to shut down again. What happens then is that the partner walsender gets hung up for tens of seconds, and until it sends back the "COPY 0" CommandComplete message, the sync worker can't finish and exit. What seems to be the reason for the hang is that WalSndWaitForWal() is being called to ask for WAL that doesn't exist yet, and unlike the other loops in walsender.c, it contains no provision for exiting early when the client indicates it's lost interest. So we just sit there until something happens on the master server. Although this situation might not last very long on production servers, it still seems like a serious bug. I was able to make the hang go away by means of the attached patch that allows WalSndWaitForWal to exit early if the client has shut down the COPY. However, since that function is miserably underdocumented (like most of this code :-(), I have little idea if this is correct or safe. I also wonder why WalSndWaitForWal is being called for WAL that seemingly doesn't exist yet, and whether that doesn't indicate another bug somewhere in this stack. I'm also kind of wondering why the "behind the apply" path out of LogicalRepSyncTableStart exists at all; as far as I can tell we'd be much better off if we just let the sync worker exit always as soon as it's done the initial sync, letting any extra catchup happen later. The main thing the current behavior seems to be accomplishing is to monopolize one of the scarce max_sync_workers_per_subscription slots for the benefit of a single table, for longer than necessary. Plus it adds additional complicated interprocess signaling. regards, tom lane diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index f845180..2611d62 100644 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *************** WalSndWaitForWal(XLogRecPtr loc) *** 1374,1379 **** --- 1374,1387 ---- if (pq_flush_if_writable() != 0) WalSndShutdown(); + /* + * If we have received CopyDone from the client, sent CopyDone + * ourselves, and the output buffer is empty, it's time to exit + * streaming, so fail the current WAL fetch request. + */ + if (!pq_is_send_pending() && streamingDoneSending && streamingDoneReceiving) + break; + now = GetCurrentTimestamp(); /* die if timeout was reached */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 2017-06-29 20:07:14 -0400, Tom Lane wrote: > I was able to make the hang go away by means of the attached patch that > allows WalSndWaitForWal to exit early if the client has shut down the > COPY. However, since that function is miserably underdocumented (like > most of this code :-(), I have little idea if this is correct or safe. Seems reasonable to me. > I also wonder why WalSndWaitForWal is being called for WAL that seemingly > doesn't exist yet, and whether that doesn't indicate another bug somewhere > in this stack. That's pretty normal - we can only send back something once a transaction is complete, and until that happens we'll just block waiting for more WAL. Greetings, Andres Freund
On 30/06/17 02:07, Tom Lane wrote: > I'm also kind of wondering why the "behind the apply" path out of > LogicalRepSyncTableStart exists at all; as far as I can tell we'd be much > better off if we just let the sync worker exit always as soon as it's done > the initial sync, letting any extra catchup happen later. The main thing > the current behavior seems to be accomplishing is to monopolize one of the > scarce max_sync_workers_per_subscription slots for the benefit of a single > table, for longer than necessary. Plus it adds additional complicated > interprocess signaling. > Hmm, I don't understand what you mean here. The "letting any extra catchup happen later" would never happen if the sync is behind apply as apply has already skipped relevant transactions. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: > On 30/06/17 02:07, Tom Lane wrote: >> I'm also kind of wondering why the "behind the apply" path out of >> LogicalRepSyncTableStart exists at all; as far as I can tell we'd be much >> better off if we just let the sync worker exit always as soon as it's done >> the initial sync, letting any extra catchup happen later. The main thing >> the current behavior seems to be accomplishing is to monopolize one of the >> scarce max_sync_workers_per_subscription slots for the benefit of a single >> table, for longer than necessary. Plus it adds additional complicated >> interprocess signaling. > Hmm, I don't understand what you mean here. The "letting any extra > catchup happen later" would never happen if the sync is behind apply as > apply has already skipped relevant transactions. Once the sync worker has exited, we have to have some other way of dealing with that. I'm wondering why we can't let that other way take over immediately. The existing approach is inefficient, according to the traces I've been poring over all day, and frankly I am very far from convinced that it's bug-free either. regards, tom lane
On 30/06/17 04:46, Tom Lane wrote: > Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: >> On 30/06/17 02:07, Tom Lane wrote: >>> I'm also kind of wondering why the "behind the apply" path out of >>> LogicalRepSyncTableStart exists at all; as far as I can tell we'd be much >>> better off if we just let the sync worker exit always as soon as it's done >>> the initial sync, letting any extra catchup happen later. The main thing >>> the current behavior seems to be accomplishing is to monopolize one of the >>> scarce max_sync_workers_per_subscription slots for the benefit of a single >>> table, for longer than necessary. Plus it adds additional complicated >>> interprocess signaling. > >> Hmm, I don't understand what you mean here. The "letting any extra >> catchup happen later" would never happen if the sync is behind apply as >> apply has already skipped relevant transactions. > > Once the sync worker has exited, we have to have some other way of dealing > with that. I'm wondering why we can't let that other way take over We make apply wait for the sync worker to get to expected position if it was behind and only then continue, we can't exactly do that if the apply already skipped some changes. > immediately. The existing approach is inefficient, according to the > traces I've been poring over all day, and frankly I am very far from > convinced that it's bug-free either. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I wrote: > I've been poking into the src/test/subscription TAP tests, thinking > that they seem a lot slower than they ought to be. The first thing > I came across was this bit in WaitForReplicationWorkerAttach(): > /* > * We need timeout because we generally don't get notified via latch > * about the worker attach. > */ > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > 1000L, WAIT_EVENT_BGWORKER_STARTUP); > Tracing showed that the worker is generally done attaching within two > or three milliseconds of the request, making the 1000ms delay in this > loop a rather serious killer of startup performance. I think the right > way to fix that is to arrange to have a signal sent back from the worker; > but just to confirm that this *is* a bottleneck, I tried cutting the > timeout way back. I found that logicalrep_worker_stop() has pretty much the identical issue: it needs to wait for a worker to attach or detach, and it isn't going to get a latch event for that, and it's using 1-second wait quanta for events that can be expected to occur within milliseconds. I still think that the "right" fix would involve adding a way to get a latch signal for attach/detach, but after some investigation it seemed like a less-than-trivial change. Since we could do without risky changes post-beta, I propose that we just reduce these wait parameters to 10ms, as in the first attached patch. It may not ever be worth working harder than that, since these waits only occur when starting/stopping/changing logical replication. I found another source of one-second-ish delays, which is that when activity on the master stops, LogicalRepApplyLoop generally goes to sleep without having ack'd the last bit of activity it replayed, since that won't have been flushed by the walwriter yet. It will send a feedback message if it times out --- but that takes a whole second. Maybe longer, if unrelated wakeups prevent it from seeing WL_TIMEOUT for a few loop iterations. In reality, we can expect the last WAL to have been flushed by the walwriter in, more or less, WalWriterDelay msec. Therefore, the second attached patch tweaks the LogicalRepApplyLoop loop to wait at most WalWriterDelay if it's got unflushed transactions yet to ack to the sender, and ensures that we will call send_feedback in the loop even if no new data arrived. These two changes cut about 15% off the runtime of the subscription tests for me (compared to where things stand after 1f201a818). They're probably less interesting than that for production purposes, but I think it's worth doing anyway. regards, tom lane diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 961110c..d165d51 100644 *** a/src/backend/replication/logical/launcher.c --- b/src/backend/replication/logical/launcher.c *************** WaitForReplicationWorkerAttach(LogicalRe *** 201,211 **** /* * We need timeout because we generally don't get notified via latch ! * about the worker attach. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 1000L, WAIT_EVENT_BGWORKER_STARTUP); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) --- 201,211 ---- /* * We need timeout because we generally don't get notified via latch ! * about the worker attach. But we don't expect to have to wait long. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 10L, WAIT_EVENT_BGWORKER_STARTUP); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) *************** retry: *** 408,415 **** } /* ! * Stop the logical replication worker and wait until it detaches from the ! * slot. */ void logicalrep_worker_stop(Oid subid, Oid relid) --- 408,415 ---- } /* ! * Stop the logical replication worker for subid/relid, if any, and wait until ! * it detaches from the slot. */ void logicalrep_worker_stop(Oid subid, Oid relid) *************** logicalrep_worker_stop(Oid subid, Oid re *** 435,442 **** generation = worker->generation; /* ! * If we found worker but it does not have proc set it is starting up, ! * wait for it to finish and then kill it. */ while (worker->in_use && !worker->proc) { --- 435,442 ---- generation = worker->generation; /* ! * If we found a worker but it does not have proc set then it is still ! * starting up; wait for it to finish starting and then kill it. */ while (worker->in_use && !worker->proc) { *************** logicalrep_worker_stop(Oid subid, Oid re *** 444,453 **** LWLockRelease(LogicalRepWorkerLock); ! /* Wait for signal. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 1000L, WAIT_EVENT_BGWORKER_STARTUP); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) --- 444,453 ---- LWLockRelease(LogicalRepWorkerLock); ! /* Wait a bit --- we don't expect to have to wait long. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 10L, WAIT_EVENT_BGWORKER_STARTUP); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) *************** logicalrep_worker_stop(Oid subid, Oid re *** 459,465 **** CHECK_FOR_INTERRUPTS(); } ! /* Check worker status. */ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); /* --- 459,465 ---- CHECK_FOR_INTERRUPTS(); } ! /* Recheck worker status. */ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); /* *************** logicalrep_worker_stop(Oid subid, Oid re *** 480,506 **** /* Now terminate the worker ... */ kill(worker->proc->pid, SIGTERM); - LWLockRelease(LogicalRepWorkerLock); /* ... and wait for it to die. */ for (;;) { int rc; ! LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); if (!worker->proc || worker->generation != generation) - { - LWLockRelease(LogicalRepWorkerLock); break; - } - LWLockRelease(LogicalRepWorkerLock); ! CHECK_FOR_INTERRUPTS(); ! /* Wait for more work. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 1000L, WAIT_EVENT_BGWORKER_SHUTDOWN); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) --- 480,501 ---- /* Now terminate the worker ... */ kill(worker->proc->pid, SIGTERM); /* ... and wait for it to die. */ for (;;) { int rc; ! /* is it gone? */ if (!worker->proc || worker->generation != generation) break; ! LWLockRelease(LogicalRepWorkerLock); ! /* Wait a bit --- we don't expect to have to wait long. */ rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! 10L, WAIT_EVENT_BGWORKER_SHUTDOWN); /* emergency bailout if postmaster has died */ if (rc & WL_POSTMASTER_DEATH) *************** logicalrep_worker_stop(Oid subid, Oid re *** 511,517 **** --- 506,516 ---- ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } + + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); } + + LWLockRelease(LogicalRepWorkerLock); } /* diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 898c497..0d48dfa 100644 *** a/src/backend/replication/logical/worker.c --- b/src/backend/replication/logical/worker.c *************** *** 52,57 **** --- 52,58 ---- #include "postmaster/bgworker.h" #include "postmaster/postmaster.h" + #include "postmaster/walwriter.h" #include "replication/decode.h" #include "replication/logical.h" *************** LogicalRepApplyLoop(XLogRecPtr last_rece *** 1027,1032 **** --- 1028,1034 ---- bool endofstream = false; TimestampTz last_recv_timestamp = GetCurrentTimestamp(); bool ping_sent = false; + long wait_time; CHECK_FOR_INTERRUPTS(); *************** LogicalRepApplyLoop(XLogRecPtr last_rece *** 1114,1124 **** len = walrcv_receive(wrconn, &buf, &fd); } - - /* confirm all writes at once */ - send_feedback(last_received, false, false); } if (!in_remote_transaction) { /* --- 1116,1126 ---- len = walrcv_receive(wrconn, &buf, &fd); } } + /* confirm all writes so far */ + send_feedback(last_received, false, false); + if (!in_remote_transaction) { /* *************** LogicalRepApplyLoop(XLogRecPtr last_rece *** 1147,1158 **** } /* ! * Wait for more data or latch. */ rc = WaitLatchOrSocket(MyLatch, WL_SOCKET_READABLE | WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! fd, NAPTIME_PER_CYCLE, WAIT_EVENT_LOGICAL_APPLY_MAIN); /* Emergency bailout if postmaster has died */ --- 1149,1169 ---- } /* ! * Wait for more data or latch. If we have unflushed transactions, ! * wake up after WalWriterDelay to see if they've been flushed yet (in ! * which case we should send a feedback message). Otherwise, there's ! * no particular urgency about waking up unless we get data or a ! * signal. */ + if (!dlist_is_empty(&lsn_mapping)) + wait_time = WalWriterDelay; + else + wait_time = NAPTIME_PER_CYCLE; + rc = WaitLatchOrSocket(MyLatch, WL_SOCKET_READABLE | WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ! fd, wait_time, WAIT_EVENT_LOGICAL_APPLY_MAIN); /* Emergency bailout if postmaster has died */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers