Thread: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

Hi,

while working on some logical replication stuff, I noticed that on PG16
I often end up with a completely idle publisher (no user activity), that
however looks like this in top:

 ...  %CPU  COMMAND


 ...  17.9  postgres: walsender user test ::1(43064) START_REPLICATION


 ...  16.6  postgres: walsender user test ::1(43128) START_REPLICATION


 ...  15.0  postgres: walsender user test ::1(43202) START_REPLICATION


 ...   6.6  postgres: walsender user test ::1(43236) START_REPLICATION


 ...   5.3  postgres: walsender user test ::1(43086) START_REPLICATION


 ...   4.3  postgres: walsender user test ::1(43180) START_REPLICATION


 ...   3.7  postgres: walsender user test ::1(43052) START_REPLICATION


 ...   3.7  postgres: walsender user test ::1(43158) START_REPLICATION


 ...   3.0  postgres: walsender user test ::1(43108) START_REPLICATION


 ...   3.0  postgres: walsender user test ::1(43214) START_REPLICATION



That's an awful lot of CPU for a cluster doing essentially "nothing"
(there's no WAL to decode/replicate, etc.). It usually disappears after
a couple seconds, but sometimes it's a rather persistent state.

The profile from the walsender processes looks like this:

    --99.94%--XLogSendLogical
              |
              |--99.23%--XLogReadRecord
              |          XLogReadAhead
              |          XLogDecodeNextRecord
              |          ReadPageInternal
              |          logical_read_xlog_page
              |          |
              |          |--97.80%--WalSndWaitForWal
              |          |          |
              |          |          |--68.48%--WalSndWait

It seems to me the issue is in WalSndWait, which was reworked to use
ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
waking each other in a busy loop, until the timing changes just enough
to break the cycle.

I've been unable to reproduce this on PG15, and bc971f4025c seems like
the most significant change to WalSndWait, which is why I suspect it's
related to the issue.

Reproducing this is simple, create a publisher with multiple subscribers
(could even go to the same subscriber instance) and empty publications.
If you trigger a "noop" it's likely to cause this high memory usage:

---------------------------------------------------------------------
# init two clusters
pg_ctl -D data-publisher init
pg_ctl -D data-subscriber init

# change the parameters to allow 10 subscriptions
echo 'wal_level = logical' >> data-publisher/postgresql.conf
echo 'port = 5433' >> data-subscriber/postgresql.conf
echo 'max_worker_processes = 20' >> data-subscriber/postgresql.conf
echo 'max_logical_replication_workers = 20' >>
data-subscriber/postgresql.conf

# setup empty publication
createdb test
psql test -c "create publication p";

# setup 10 subscriptions
for i in `seq 1 10`; do
  createdb -p 5433 test$i
  psql -p 5433 test$i -c "create subscription s$i CONNECTION
'host=localhost port=5432 dbname=test' publication p";
done

# emit logical messages, which are almost noop, 5s apart
for i in `seq 1 10`; do
    psql test -c "select pg_logical_emit_message(false, 'x', 'x')";
    sleep 5;
done;
---------------------------------------------------------------------


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Hi,

On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
> That's an awful lot of CPU for a cluster doing essentially "nothing"
> (there's no WAL to decode/replicate, etc.). It usually disappears after
> a couple seconds, but sometimes it's a rather persistent state.

Ugh, that's not great.

> The profile from the walsender processes looks like this:
> 
>     --99.94%--XLogSendLogical
>               |
>               |--99.23%--XLogReadRecord
>               |          XLogReadAhead
>               |          XLogDecodeNextRecord
>               |          ReadPageInternal
>               |          logical_read_xlog_page
>               |          |
>               |          |--97.80%--WalSndWaitForWal
>               |          |          |
>               |          |          |--68.48%--WalSndWait
> 
> It seems to me the issue is in WalSndWait, which was reworked to use
> ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
> waking each other in a busy loop, until the timing changes just enough
> to break the cycle.

IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
processes can nearly be considered a bug, at least when combined with
ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
can cause situations like this, where condition variables basically
deteriorate to a busy loop.

I hit this with AIO as well. I've been "solving" it by adding a
ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.

I'm inclined to think that any code that needs that needs the forwarding
behaviour is pretty much buggy.

Greetings,

Andres Freund



On Sat, Aug 12, 2023 at 5:51 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
> > It seems to me the issue is in WalSndWait, which was reworked to use
> > ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
> > waking each other in a busy loop, until the timing changes just enough
> > to break the cycle.
>
> IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
> processes can nearly be considered a bug, at least when combined with
> ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
> can cause situations like this, where condition variables basically
> deteriorate to a busy loop.
>
> I hit this with AIO as well. I've been "solving" it by adding a
> ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.
>
> I'm inclined to think that any code that needs that needs the forwarding
> behaviour is pretty much buggy.

Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
idea, but bc971f4025c broke an assumption, since it doesn't use
ConditionVariableSleep().  That is confusing the signal forwarding
logic which expects to find our entry in the wait list in the common
case.

What do you think about this patch?

Attachment
Hi,

On 2023-08-12 07:51:09 +1200, Thomas Munro wrote:
> Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
> idea, but bc971f4025c broke an assumption, since it doesn't use
> ConditionVariableSleep().  That is confusing the signal forwarding
> logic which expects to find our entry in the wait list in the common
> case.

Hm, I guess I got confused by the cv code once more. I thought that
ConditionVariableCancelSleep() would wake us up anyway, because
once we return from ConditionVariableSleep(), we'd be off the list. But I now
realize (and I think not for the first time), that ConditionVariableSleep()
always puts us *back* on the list.


Leaving aside the issue in this thread, isn't always adding us back into the
list bad from a contention POV alone - it doubles the write traffic on the CV
and is guaranteed to cause contention for ConditionVariableBroadcast().  I
wonder if this explains some performance issues I've seen in the past.

What if we instead reset cv_sleep_target once we've been taken off the list? I
think it'd not be too hard to make it safe to do the proclist_contains()
without the spinlock.  Lwlocks have something similar, there we solve it by
this sequence:

        proclist_delete(&wakeup, iter.cur, lwWaitLink);

        /*
         * Guarantee that lwWaiting being unset only becomes visible once the
         * unlink from the link has completed. Otherwise the target backend
         * could be woken up for other reason and enqueue for a new lock - if
         * that happens before the list unlink happens, the list would end up
         * being corrupted.
         *
         * The barrier pairs with the LWLockWaitListLock() when enqueuing for
         * another lock.
         */
        pg_write_barrier();
        waiter->lwWaiting = LW_WS_NOT_WAITING;
        PGSemaphoreUnlock(waiter->sem);

I guess this means we'd need something like lwWaiting for CVs as well.


> From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Sat, 12 Aug 2023 07:06:08 +1200
> Subject: [PATCH] De-pessimize ConditionVariableCancelSleep().
>
> Commit b91dd9de was concerned with a theoretical problem with our
> non-atomic condition variable operations.  If you stop sleeping, and
> then cancel the sleep in a separate step, you might be signaled in
> between, and that could be lost.  That doesn't matter for callers of
> ConditionVariableBroadcast(), but callers of ConditionVariableSignal()
> might be upset if a signal went missing like this.

FWIW I suspect at least some of the places that'd have a problem without that
forwarding, might also be racy with it....


> New idea: ConditionVariableCancelSleep() can just return true if you've
> been signaled.  Hypothetical users of ConditionVariableSignal() would
> then still have a way to deal with rare lost signals if they are
> concerned about that problem.

Sounds like a plan to me.

Greetings,

Andres Freund




On 8/11/23 21:51, Thomas Munro wrote:
> On Sat, Aug 12, 2023 at 5:51 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
>>> It seems to me the issue is in WalSndWait, which was reworked to use
>>> ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
>>> waking each other in a busy loop, until the timing changes just enough
>>> to break the cycle.
>>
>> IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
>> processes can nearly be considered a bug, at least when combined with
>> ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
>> can cause situations like this, where condition variables basically
>> deteriorate to a busy loop.
>>
>> I hit this with AIO as well. I've been "solving" it by adding a
>> ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.
>>
>> I'm inclined to think that any code that needs that needs the forwarding
>> behaviour is pretty much buggy.
> 
> Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
> idea, but bc971f4025c broke an assumption, since it doesn't use
> ConditionVariableSleep().  That is confusing the signal forwarding
> logic which expects to find our entry in the wait list in the common
> case.
> 
> What do you think about this patch?

I'm not familiar with the condition variable code enough to have an
opinion, but the patch seems to resolve the issue for me - I can no
longer reproduce the high CPU usage.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I'm not familiar with the condition variable code enough to have an
> opinion, but the patch seems to resolve the issue for me - I can no
> longer reproduce the high CPU usage.

Thanks, pushed.



Thomas Munro <thomas.munro@gmail.com> wrote:

> On Tue, Aug 15, 2023 at 2:23 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > I'm not familiar with the condition variable code enough to have an
> > opinion, but the patch seems to resolve the issue for me - I can no
> > longer reproduce the high CPU usage.
>
> Thanks, pushed.

I try to understand this patch (commit 5ffb7c7750) because I use condition
variable in an extension. One particular problem occured to me, please
consider:

ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't
at least AbortTransaction() and AbortSubTransaction() check the return value
of ConditionVariableCancelSleep(), and re-send the signal if needed?

Note that I'm just thinking about such a problem, did not try to reproduce it.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah@cybertec.at> wrote:
> I try to understand this patch (commit 5ffb7c7750) because I use condition
> variable in an extension. One particular problem occured to me, please
> consider:
>
> ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
> ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't
> at least AbortTransaction() and AbortSubTransaction() check the return value
> of ConditionVariableCancelSleep(), and re-send the signal if needed?

I wondered about that in the context of our only in-tree user of
ConditionVariableSignal(), in parallel btree index creation, but since
it's using the parallel executor infrastructure, any error would be
propagated everywhere so all waits would be aborted.  There is another
place where the infrastructure cancels for you and would now eat the
result: if you prepare to sleep on one CV, and then prepare to sleep
on another, we''ll just cancel the first one.  It think that's
semantically OK: we can't really wait for two CVs at once, and if you
try you'll miss signals anyway, but you must already have code to cope
with that by re-checking your exit conditions.

> Note that I'm just thinking about such a problem, did not try to reproduce it.

Hmm.  I looked for users of ConditionVariableSignal() in the usual web
tools and didn't find anything, so I guess your extension is not
released yet or not open source.  I'm curious: what does it actually
do if there is an error in a CV-wakeup-consuming backend?  I guess it
might be some kind of work-queue processing system... it seems
inevitable that if backends are failing with errors, and you don't
respond by retrying/respawning, you'll lose or significantly delay
jobs/events/something anyway (imagine only slightly different timing:
you consume the signal and start working on a job and then ereport,
which amounts to the same thing in the end now that your transaction
is rolled back?), and when you retry you'll see whatever condition was
waited for anyway.  But that's just me imagining what some
hypothetical strawman system might look like...  what does it really
do?

(FWIW when I worked on a couple of different work queue-like systems
and tried to use ConditionVariableSignal() I eventually concluded that
it was the wrong tool for the job because its wakeup order is
undefined.  It's actually FIFO, but I wanted LIFO so that workers have
a chance to become idle and reduce the pool size, but I started to
think that once you want that level of control you really want to
build a bespoke wait list system, so I never got around to proposing
that we consider changing that.)

Our condition variables are weird.  They're not associated with a
lock, so we made start-of-wait non-atomic: prepare first, then return
control and let the caller check its condition, then sleep.  Typical
user space condition variable APIs force you to acquire some kind of
lock that protects the condition first, then check the condition, then
atomically release-associated-lock-and-start-sleeping, so there is no
data race but also no time where control is returned to the caller but
the thread is on the wait list consuming signals.  That choice has
some pros (you can choose whatever type of lock you want to protect
your condition, or none at all if you can get away with memory
barriers and magic) and cons..  However, as I think Andres was getting
at, having a non-atomic start-of-wait doesn't seem to require us to
have a non-atomic end-of-wait and associated extra contention.  So
maybe we should figure out how to fix that in 17.



Thomas Munro <thomas.munro@gmail.com> wrote:

> On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah@cybertec.at> wrote:
> > I try to understand this patch (commit 5ffb7c7750) because I use condition
> > variable in an extension. One particular problem occured to me, please
> > consider:
> >
> > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
> > ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't
> > at least AbortTransaction() and AbortSubTransaction() check the return value
> > of ConditionVariableCancelSleep(), and re-send the signal if needed?
>
> I wondered about that in the context of our only in-tree user of
> ConditionVariableSignal(), in parallel btree index creation, but since
> it's using the parallel executor infrastructure, any error would be
> propagated everywhere so all waits would be aborted.

I see, ConditionVariableSignal() is currently used only to signal other
workers running in the same transactions. The other parts use
ConditionVariableBroadcast(), so no consumer should miss its signal.

> > Note that I'm just thinking about such a problem, did not try to reproduce it.
>
> Hmm.  I looked for users of ConditionVariableSignal() in the usual web
> tools and didn't find anything, so I guess your extension is not
> released yet or not open source.  I'm curious: what does it actually
> do if there is an error in a CV-wakeup-consuming backend?  I guess it
> might be some kind of work-queue processing system... it seems
> inevitable that if backends are failing with errors, and you don't
> respond by retrying/respawning, you'll lose or significantly delay
> jobs/events/something anyway (imagine only slightly different timing:
> you consume the signal and start working on a job and then ereport,
> which amounts to the same thing in the end now that your transaction
> is rolled back?), and when you retry you'll see whatever condition was
> waited for anyway.  But that's just me imagining what some
> hypothetical strawman system might look like...  what does it really
> do?

If you're interested, the extension is pg_squeeze [1]. I think the use case is
rather special. All the work is done by a background worker, but an user
function can be called to submit a "task" for the worker and wait for its
completion. So the function sleeps on a CV and the worker uses the CV to wake
it up. If this function ends due to ERROR, the user is supposed to find a log
message in the worker output sooner or later. It may sound weird, but that
function exists primarily for regression tests, so ERROR is a problem anyway.

> (FWIW when I worked on a couple of different work queue-like systems
> and tried to use ConditionVariableSignal() I eventually concluded that
> it was the wrong tool for the job because its wakeup order is
> undefined.  It's actually FIFO, but I wanted LIFO so that workers have
> a chance to become idle and reduce the pool size, but I started to
> think that once you want that level of control you really want to
> build a bespoke wait list system, so I never got around to proposing
> that we consider changing that.)
>
> Our condition variables are weird.  They're not associated with a
> lock, so we made start-of-wait non-atomic: prepare first, then return
> control and let the caller check its condition, then sleep.  Typical
> user space condition variable APIs force you to acquire some kind of
> lock that protects the condition first, then check the condition, then
> atomically release-associated-lock-and-start-sleeping, so there is no
> data race but also no time where control is returned to the caller but
> the thread is on the wait list consuming signals.  That choice has
> some pros (you can choose whatever type of lock you want to protect
> your condition, or none at all if you can get away with memory
> barriers and magic) and cons..  However, as I think Andres was getting
> at, having a non-atomic start-of-wait doesn't seem to require us to
> have a non-atomic end-of-wait and associated extra contention.  So
> maybe we should figure out how to fix that in 17.

Thanks for sharing your point of view. I'm fine with this low-level approach:
it's well documented and there are examples in the PG code showing how it
should be used :-)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

https://github.com/cybertec-postgresql/pg_squeeze/