Thread: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

Hello,

Recently we’ve faced an interesting situation on a standby environment 
with configured cascade replication and large number (~100) of 
configured walsenders. We’ve noticed a very high CPU consumption on such 
environment with the most time-consuming operation being signal delivery 
from startup recovery process to walsenders via WalSndWakeup invocations 
from ApplyWalRecord in xlogrecovery.c.

The startup standby process notifies walsenders for downstream systems 
using ConditionVariableBroadcast (CV), so only processes waiting on this 
CV need to be contacted. However in case of high load we seems to be 
hitting here a bottleneck anyway. The current implementation tries to 
send notification after processing of each WAL record (i.e. during each 
invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup 
invocations. At the same time, this also provides each walsender with 
very small chunk of data to process, so almost every process will be 
present in the CV wait list for the next iteration. As result, waiting 
list should be always fully packed in such case, which additionally 
reduces performance of WAL records processing by the standby instance.

To reproduce such behavior we could use a simple environment with three 
servers: primary instance, attached physical standby and its downstream 
server with large number of logical replication subscriptions. Attached 
is the synthetic test case (test_scenario.zip) to reproduce this 
behavior – script ‘test_prepare.sh’ could be used to create required 
environment with test data and ‘test_execute.sh’ script executes 
‘pgbench’ tool with simple updates against primary instance to trigger 
replication to other servers. All CPUs on my machine could be completely 
saturated with just about 6 clients after 30 seconds in the test. Please 
check the environment properties at the top of these scripts before 
running them, as they need to be updated in order to specify location 
for installed PG build, target location for database instances creation 
and used ports.

Thinking about possible ways to improve such case, some sort of batching 
may be a working option. We could try to postpone sending notification 
until recovery has applied some number of messages. This will reduce 
rate of CV notifications and will also give receivers more data to 
process, so they may not need to enter the CV wait state so often. 
Counting applied records is not difficult, but the tricky part here is 
to ensure that we do not postpone notifications for too long in case of 
low load.

The first possible approach to limit such delay is to just ask 
ReadRecord to return control back on the first iteration if there are 
currently no more incoming records. From the standby recovery 
perspective we are either reading and applying records from the upstream 
instance or waiting for new message to become available. We could 
introduce additional field in XLogPageReadPrivate to notify XlogPageRead 
on first iteration of ReadRecord to perform non-blocking read and return 
null message if no data is currently available. In this case we may 
immediately notify senders and then switch to regular blocking read 
invocations.

However, in this approach we could not control time required to apply a 
single record and thus could not effectively control the resulting 
delay. We can hit a message which requires a lot of time to apply (e.g. 
‘drop database’), which will hold all messages in the batch from being 
delivered to walsenders. Another possible option here is to use a timer 
to schedule notification if we are either waiting for records or 
applying them. So, in this case the WalSndWakeup will be invoked either 
after applying certain number of messages or after expiration of timeout 
since last notification.

We’ve tried to create a proof of concept for such approach and made 
another finding: it seems, that current implementation of 
ConditionVariable notification does not work correctly if process is 
waiting on some CV while also trying to send a broadcast to some other 
CV from the interrupt handler. The comments in code implies that such 
case is supported, but in reality it could leave CV in the inconsistent 
state and cause problems on subsequent invocations of CV broadcast. The 
invocation of ConditionVariableBroadcast removes current process from 
the wakeup list of CV referenced by the ‘cv_sleep_target’ variable and 
then clears this variable as well. Once the ConditionVariableTimedSleep 
method returns from the sleep it adds itself back to the wakeup list, 
but then just returns control back to the client without restoring the 
‘cv_sleep_target’ value. In this case subsequent invocation of the 
ConditionVariableCancelSleep won’t be able to locate correct CV and 
correctly complete the wait cycle (current process will remain in the 
list), so the next invocation of ConditionVariableBroadcast for this CV 
may fail on assertions (we are not expecting to find ourselves on the 
list of processes which need to receive our signal). I think, that the 
correct approach here is to just return control to the client just after 
the wait if ‘cv_sleep_target’ has changed. We do not need to add 
ourselves to the wakeup list in such case, as client will check exit 
condition in the cycle and then either invoke 
ConditionVariableCancelSleep or ConditionVariableTimedSleep again. Both 
methods will correctly update the CV state, so we won’t miss any 
notifications or leave the CV in the improper state.

We’ve prepared two test patches on top of current master to address both 
issues:
* 0001-CV-correctly-handle-cv_sleep_target-change.patch – adjust CV 
logic to correctly process case with ‘cv_sleep_target’ change during sleep.
* 0002-Implement-batching-for-cascade-replication.patch – test patch to 
implement possible batching approach in xlogreceiver.c with timer. 
Currently it uses GUC variables to allow testing of different batch 
sizes and timeout values.

With both patches applied we’ve noticed a significant reduction in CPU 
consumption while using the synthetic test program mentioned above. If 
only the second patch is applied to the build, then problem with CV 
variable could be reproduced by building server with casserts enabled 
and running TAP tests in ‘src/test/recovery’ directory.

It would be great to hear any thoughts on these observations and fixing 
approaches, as well as possible pitfalls of proposed changes.

Thanks,
Alexey

Attachment
 > We’ve prepared two test patches on top of current master to address 
both issues:
 > ...
 > * 0002-Implement-batching-for-cascade-replication.patch – test patch 
to implement possible batching approach in xlogreceiver.c with timer. 
Currently it uses GUC variables to allow testing of different batch 
sizes and timeout values.

I've played with the second patch a little more and made some 
adjustments to it:
1. Setup timer only if we actually have applied messages, which are 
(potentially) not yet signaled to walsenders.
2. Notify logical walsenders without delay if time line has changed.

Modified patch is attached.

Thanks,
Alexey

Attachment