Thread: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Tomas Vondra
Date:
Hi, while working on some logical replication stuff, I noticed that on PG16 I often end up with a completely idle publisher (no user activity), that however looks like this in top: ... %CPU COMMAND ... 17.9 postgres: walsender user test ::1(43064) START_REPLICATION ... 16.6 postgres: walsender user test ::1(43128) START_REPLICATION ... 15.0 postgres: walsender user test ::1(43202) START_REPLICATION ... 6.6 postgres: walsender user test ::1(43236) START_REPLICATION ... 5.3 postgres: walsender user test ::1(43086) START_REPLICATION ... 4.3 postgres: walsender user test ::1(43180) START_REPLICATION ... 3.7 postgres: walsender user test ::1(43052) START_REPLICATION ... 3.7 postgres: walsender user test ::1(43158) START_REPLICATION ... 3.0 postgres: walsender user test ::1(43108) START_REPLICATION ... 3.0 postgres: walsender user test ::1(43214) START_REPLICATION That's an awful lot of CPU for a cluster doing essentially "nothing" (there's no WAL to decode/replicate, etc.). It usually disappears after a couple seconds, but sometimes it's a rather persistent state. The profile from the walsender processes looks like this: --99.94%--XLogSendLogical | |--99.23%--XLogReadRecord | XLogReadAhead | XLogDecodeNextRecord | ReadPageInternal | logical_read_xlog_page | | | |--97.80%--WalSndWaitForWal | | | | | |--68.48%--WalSndWait It seems to me the issue is in WalSndWait, which was reworked to use ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up waking each other in a busy loop, until the timing changes just enough to break the cycle. I've been unable to reproduce this on PG15, and bc971f4025c seems like the most significant change to WalSndWait, which is why I suspect it's related to the issue. Reproducing this is simple, create a publisher with multiple subscribers (could even go to the same subscriber instance) and empty publications. If you trigger a "noop" it's likely to cause this high memory usage: --------------------------------------------------------------------- # init two clusters pg_ctl -D data-publisher init pg_ctl -D data-subscriber init # change the parameters to allow 10 subscriptions echo 'wal_level = logical' >> data-publisher/postgresql.conf echo 'port = 5433' >> data-subscriber/postgresql.conf echo 'max_worker_processes = 20' >> data-subscriber/postgresql.conf echo 'max_logical_replication_workers = 20' >> data-subscriber/postgresql.conf # setup empty publication createdb test psql test -c "create publication p"; # setup 10 subscriptions for i in `seq 1 10`; do createdb -p 5433 test$i psql -p 5433 test$i -c "create subscription s$i CONNECTION 'host=localhost port=5432 dbname=test' publication p"; done # emit logical messages, which are almost noop, 5s apart for i in `seq 1 10`; do psql test -c "select pg_logical_emit_message(false, 'x', 'x')"; sleep 5; done; --------------------------------------------------------------------- regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Andres Freund
Date:
Hi, On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote: > That's an awful lot of CPU for a cluster doing essentially "nothing" > (there's no WAL to decode/replicate, etc.). It usually disappears after > a couple seconds, but sometimes it's a rather persistent state. Ugh, that's not great. > The profile from the walsender processes looks like this: > > --99.94%--XLogSendLogical > | > |--99.23%--XLogReadRecord > | XLogReadAhead > | XLogDecodeNextRecord > | ReadPageInternal > | logical_read_xlog_page > | | > | |--97.80%--WalSndWaitForWal > | | | > | | |--68.48%--WalSndWait > > It seems to me the issue is in WalSndWait, which was reworked to use > ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up > waking each other in a busy loop, until the timing changes just enough > to break the cycle. IMO ConditionVariableCancelSleep()'s behaviour of waking up additional processes can nearly be considered a bug, at least when combined with ConditionVariableBroadcast(). In that case the wakeup is never needed, and it can cause situations like this, where condition variables basically deteriorate to a busy loop. I hit this with AIO as well. I've been "solving" it by adding a ConditionVariableCancelSleepEx(), which has a only_broadcasts argument. I'm inclined to think that any code that needs that needs the forwarding behaviour is pretty much buggy. Greetings, Andres Freund
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Thomas Munro
Date:
On Sat, Aug 12, 2023 at 5:51 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote: > > It seems to me the issue is in WalSndWait, which was reworked to use > > ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up > > waking each other in a busy loop, until the timing changes just enough > > to break the cycle. > > IMO ConditionVariableCancelSleep()'s behaviour of waking up additional > processes can nearly be considered a bug, at least when combined with > ConditionVariableBroadcast(). In that case the wakeup is never needed, and it > can cause situations like this, where condition variables basically > deteriorate to a busy loop. > > I hit this with AIO as well. I've been "solving" it by adding a > ConditionVariableCancelSleepEx(), which has a only_broadcasts argument. > > I'm inclined to think that any code that needs that needs the forwarding > behaviour is pretty much buggy. Oh, I see what's happening. Maybe commit b91dd9de wasn't the best idea, but bc971f4025c broke an assumption, since it doesn't use ConditionVariableSleep(). That is confusing the signal forwarding logic which expects to find our entry in the wait list in the common case. What do you think about this patch?
Attachment
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Andres Freund
Date:
Hi, On 2023-08-12 07:51:09 +1200, Thomas Munro wrote: > Oh, I see what's happening. Maybe commit b91dd9de wasn't the best > idea, but bc971f4025c broke an assumption, since it doesn't use > ConditionVariableSleep(). That is confusing the signal forwarding > logic which expects to find our entry in the wait list in the common > case. Hm, I guess I got confused by the cv code once more. I thought that ConditionVariableCancelSleep() would wake us up anyway, because once we return from ConditionVariableSleep(), we'd be off the list. But I now realize (and I think not for the first time), that ConditionVariableSleep() always puts us *back* on the list. Leaving aside the issue in this thread, isn't always adding us back into the list bad from a contention POV alone - it doubles the write traffic on the CV and is guaranteed to cause contention for ConditionVariableBroadcast(). I wonder if this explains some performance issues I've seen in the past. What if we instead reset cv_sleep_target once we've been taken off the list? I think it'd not be too hard to make it safe to do the proclist_contains() without the spinlock. Lwlocks have something similar, there we solve it by this sequence: proclist_delete(&wakeup, iter.cur, lwWaitLink); /* * Guarantee that lwWaiting being unset only becomes visible once the * unlink from the link has completed. Otherwise the target backend * could be woken up for other reason and enqueue for a new lock - if * that happens before the list unlink happens, the list would end up * being corrupted. * * The barrier pairs with the LWLockWaitListLock() when enqueuing for * another lock. */ pg_write_barrier(); waiter->lwWaiting = LW_WS_NOT_WAITING; PGSemaphoreUnlock(waiter->sem); I guess this means we'd need something like lwWaiting for CVs as well. > From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Sat, 12 Aug 2023 07:06:08 +1200 > Subject: [PATCH] De-pessimize ConditionVariableCancelSleep(). > > Commit b91dd9de was concerned with a theoretical problem with our > non-atomic condition variable operations. If you stop sleeping, and > then cancel the sleep in a separate step, you might be signaled in > between, and that could be lost. That doesn't matter for callers of > ConditionVariableBroadcast(), but callers of ConditionVariableSignal() > might be upset if a signal went missing like this. FWIW I suspect at least some of the places that'd have a problem without that forwarding, might also be racy with it.... > New idea: ConditionVariableCancelSleep() can just return true if you've > been signaled. Hypothetical users of ConditionVariableSignal() would > then still have a way to deal with rare lost signals if they are > concerned about that problem. Sounds like a plan to me. Greetings, Andres Freund
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Tomas Vondra
Date:
On 8/11/23 21:51, Thomas Munro wrote: > On Sat, Aug 12, 2023 at 5:51 AM Andres Freund <andres@anarazel.de> wrote: >> On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote: >>> It seems to me the issue is in WalSndWait, which was reworked to use >>> ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up >>> waking each other in a busy loop, until the timing changes just enough >>> to break the cycle. >> >> IMO ConditionVariableCancelSleep()'s behaviour of waking up additional >> processes can nearly be considered a bug, at least when combined with >> ConditionVariableBroadcast(). In that case the wakeup is never needed, and it >> can cause situations like this, where condition variables basically >> deteriorate to a busy loop. >> >> I hit this with AIO as well. I've been "solving" it by adding a >> ConditionVariableCancelSleepEx(), which has a only_broadcasts argument. >> >> I'm inclined to think that any code that needs that needs the forwarding >> behaviour is pretty much buggy. > > Oh, I see what's happening. Maybe commit b91dd9de wasn't the best > idea, but bc971f4025c broke an assumption, since it doesn't use > ConditionVariableSleep(). That is confusing the signal forwarding > logic which expects to find our entry in the wait list in the common > case. > > What do you think about this patch? I'm not familiar with the condition variable code enough to have an opinion, but the patch seems to resolve the issue for me - I can no longer reproduce the high CPU usage. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Thomas Munro
Date:
On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I'm not familiar with the condition variable code enough to have an > opinion, but the patch seems to resolve the issue for me - I can no > longer reproduce the high CPU usage. Thanks, pushed.
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Antonin Houska
Date:
Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > I'm not familiar with the condition variable code enough to have an > > opinion, but the patch seems to resolve the issue for me - I can no > > longer reproduce the high CPU usage. > > Thanks, pushed. I try to understand this patch (commit 5ffb7c7750) because I use condition variable in an extension. One particular problem occured to me, please consider: ConditionVariableSleep() gets interrupted, so AbortTransaction() calls ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't at least AbortTransaction() and AbortSubTransaction() check the return value of ConditionVariableCancelSleep(), and re-send the signal if needed? Note that I'm just thinking about such a problem, did not try to reproduce it. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Thomas Munro
Date:
On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah@cybertec.at> wrote: > I try to understand this patch (commit 5ffb7c7750) because I use condition > variable in an extension. One particular problem occured to me, please > consider: > > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls > ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't > at least AbortTransaction() and AbortSubTransaction() check the return value > of ConditionVariableCancelSleep(), and re-send the signal if needed? I wondered about that in the context of our only in-tree user of ConditionVariableSignal(), in parallel btree index creation, but since it's using the parallel executor infrastructure, any error would be propagated everywhere so all waits would be aborted. There is another place where the infrastructure cancels for you and would now eat the result: if you prepare to sleep on one CV, and then prepare to sleep on another, we''ll just cancel the first one. It think that's semantically OK: we can't really wait for two CVs at once, and if you try you'll miss signals anyway, but you must already have code to cope with that by re-checking your exit conditions. > Note that I'm just thinking about such a problem, did not try to reproduce it. Hmm. I looked for users of ConditionVariableSignal() in the usual web tools and didn't find anything, so I guess your extension is not released yet or not open source. I'm curious: what does it actually do if there is an error in a CV-wakeup-consuming backend? I guess it might be some kind of work-queue processing system... it seems inevitable that if backends are failing with errors, and you don't respond by retrying/respawning, you'll lose or significantly delay jobs/events/something anyway (imagine only slightly different timing: you consume the signal and start working on a job and then ereport, which amounts to the same thing in the end now that your transaction is rolled back?), and when you retry you'll see whatever condition was waited for anyway. But that's just me imagining what some hypothetical strawman system might look like... what does it really do? (FWIW when I worked on a couple of different work queue-like systems and tried to use ConditionVariableSignal() I eventually concluded that it was the wrong tool for the job because its wakeup order is undefined. It's actually FIFO, but I wanted LIFO so that workers have a chance to become idle and reduce the pool size, but I started to think that once you want that level of control you really want to build a bespoke wait list system, so I never got around to proposing that we consider changing that.) Our condition variables are weird. They're not associated with a lock, so we made start-of-wait non-atomic: prepare first, then return control and let the caller check its condition, then sleep. Typical user space condition variable APIs force you to acquire some kind of lock that protects the condition first, then check the condition, then atomically release-associated-lock-and-start-sleeping, so there is no data race but also no time where control is returned to the caller but the thread is on the wait list consuming signals. That choice has some pros (you can choose whatever type of lock you want to protect your condition, or none at all if you can get away with memory barriers and magic) and cons.. However, as I think Andres was getting at, having a non-atomic start-of-wait doesn't seem to require us to have a non-atomic end-of-wait and associated extra contention. So maybe we should figure out how to fix that in 17.
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
From
Antonin Houska
Date:
Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah@cybertec.at> wrote: > > I try to understand this patch (commit 5ffb7c7750) because I use condition > > variable in an extension. One particular problem occured to me, please > > consider: > > > > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls > > ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't > > at least AbortTransaction() and AbortSubTransaction() check the return value > > of ConditionVariableCancelSleep(), and re-send the signal if needed? > > I wondered about that in the context of our only in-tree user of > ConditionVariableSignal(), in parallel btree index creation, but since > it's using the parallel executor infrastructure, any error would be > propagated everywhere so all waits would be aborted. I see, ConditionVariableSignal() is currently used only to signal other workers running in the same transactions. The other parts use ConditionVariableBroadcast(), so no consumer should miss its signal. > > Note that I'm just thinking about such a problem, did not try to reproduce it. > > Hmm. I looked for users of ConditionVariableSignal() in the usual web > tools and didn't find anything, so I guess your extension is not > released yet or not open source. I'm curious: what does it actually > do if there is an error in a CV-wakeup-consuming backend? I guess it > might be some kind of work-queue processing system... it seems > inevitable that if backends are failing with errors, and you don't > respond by retrying/respawning, you'll lose or significantly delay > jobs/events/something anyway (imagine only slightly different timing: > you consume the signal and start working on a job and then ereport, > which amounts to the same thing in the end now that your transaction > is rolled back?), and when you retry you'll see whatever condition was > waited for anyway. But that's just me imagining what some > hypothetical strawman system might look like... what does it really > do? If you're interested, the extension is pg_squeeze [1]. I think the use case is rather special. All the work is done by a background worker, but an user function can be called to submit a "task" for the worker and wait for its completion. So the function sleeps on a CV and the worker uses the CV to wake it up. If this function ends due to ERROR, the user is supposed to find a log message in the worker output sooner or later. It may sound weird, but that function exists primarily for regression tests, so ERROR is a problem anyway. > (FWIW when I worked on a couple of different work queue-like systems > and tried to use ConditionVariableSignal() I eventually concluded that > it was the wrong tool for the job because its wakeup order is > undefined. It's actually FIFO, but I wanted LIFO so that workers have > a chance to become idle and reduce the pool size, but I started to > think that once you want that level of control you really want to > build a bespoke wait list system, so I never got around to proposing > that we consider changing that.) > > Our condition variables are weird. They're not associated with a > lock, so we made start-of-wait non-atomic: prepare first, then return > control and let the caller check its condition, then sleep. Typical > user space condition variable APIs force you to acquire some kind of > lock that protects the condition first, then check the condition, then > atomically release-associated-lock-and-start-sleeping, so there is no > data race but also no time where control is returned to the caller but > the thread is on the wait list consuming signals. That choice has > some pros (you can choose whatever type of lock you want to protect > your condition, or none at all if you can get away with memory > barriers and magic) and cons.. However, as I think Andres was getting > at, having a non-atomic start-of-wait doesn't seem to require us to > have a non-atomic end-of-wait and associated extra contention. So > maybe we should figure out how to fix that in 17. Thanks for sharing your point of view. I'm fine with this low-level approach: it's well documented and there are examples in the PG code showing how it should be used :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com https://github.com/cybertec-postgresql/pg_squeeze/