Thread: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
From
Alexey Makhmutov
Date:
Hello, Recently we’ve faced an interesting situation on a standby environment with configured cascade replication and large number (~100) of configured walsenders. We’ve noticed a very high CPU consumption on such environment with the most time-consuming operation being signal delivery from startup recovery process to walsenders via WalSndWakeup invocations from ApplyWalRecord in xlogrecovery.c. The startup standby process notifies walsenders for downstream systems using ConditionVariableBroadcast (CV), so only processes waiting on this CV need to be contacted. However in case of high load we seems to be hitting here a bottleneck anyway. The current implementation tries to send notification after processing of each WAL record (i.e. during each invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup invocations. At the same time, this also provides each walsender with very small chunk of data to process, so almost every process will be present in the CV wait list for the next iteration. As result, waiting list should be always fully packed in such case, which additionally reduces performance of WAL records processing by the standby instance. To reproduce such behavior we could use a simple environment with three servers: primary instance, attached physical standby and its downstream server with large number of logical replication subscriptions. Attached is the synthetic test case (test_scenario.zip) to reproduce this behavior – script ‘test_prepare.sh’ could be used to create required environment with test data and ‘test_execute.sh’ script executes ‘pgbench’ tool with simple updates against primary instance to trigger replication to other servers. All CPUs on my machine could be completely saturated with just about 6 clients after 30 seconds in the test. Please check the environment properties at the top of these scripts before running them, as they need to be updated in order to specify location for installed PG build, target location for database instances creation and used ports. Thinking about possible ways to improve such case, some sort of batching may be a working option. We could try to postpone sending notification until recovery has applied some number of messages. This will reduce rate of CV notifications and will also give receivers more data to process, so they may not need to enter the CV wait state so often. Counting applied records is not difficult, but the tricky part here is to ensure that we do not postpone notifications for too long in case of low load. The first possible approach to limit such delay is to just ask ReadRecord to return control back on the first iteration if there are currently no more incoming records. From the standby recovery perspective we are either reading and applying records from the upstream instance or waiting for new message to become available. We could introduce additional field in XLogPageReadPrivate to notify XlogPageRead on first iteration of ReadRecord to perform non-blocking read and return null message if no data is currently available. In this case we may immediately notify senders and then switch to regular blocking read invocations. However, in this approach we could not control time required to apply a single record and thus could not effectively control the resulting delay. We can hit a message which requires a lot of time to apply (e.g. ‘drop database’), which will hold all messages in the batch from being delivered to walsenders. Another possible option here is to use a timer to schedule notification if we are either waiting for records or applying them. So, in this case the WalSndWakeup will be invoked either after applying certain number of messages or after expiration of timeout since last notification. We’ve tried to create a proof of concept for such approach and made another finding: it seems, that current implementation of ConditionVariable notification does not work correctly if process is waiting on some CV while also trying to send a broadcast to some other CV from the interrupt handler. The comments in code implies that such case is supported, but in reality it could leave CV in the inconsistent state and cause problems on subsequent invocations of CV broadcast. The invocation of ConditionVariableBroadcast removes current process from the wakeup list of CV referenced by the ‘cv_sleep_target’ variable and then clears this variable as well. Once the ConditionVariableTimedSleep method returns from the sleep it adds itself back to the wakeup list, but then just returns control back to the client without restoring the ‘cv_sleep_target’ value. In this case subsequent invocation of the ConditionVariableCancelSleep won’t be able to locate correct CV and correctly complete the wait cycle (current process will remain in the list), so the next invocation of ConditionVariableBroadcast for this CV may fail on assertions (we are not expecting to find ourselves on the list of processes which need to receive our signal). I think, that the correct approach here is to just return control to the client just after the wait if ‘cv_sleep_target’ has changed. We do not need to add ourselves to the wakeup list in such case, as client will check exit condition in the cycle and then either invoke ConditionVariableCancelSleep or ConditionVariableTimedSleep again. Both methods will correctly update the CV state, so we won’t miss any notifications or leave the CV in the improper state. We’ve prepared two test patches on top of current master to address both issues: * 0001-CV-correctly-handle-cv_sleep_target-change.patch – adjust CV logic to correctly process case with ‘cv_sleep_target’ change during sleep. * 0002-Implement-batching-for-cascade-replication.patch – test patch to implement possible batching approach in xlogreceiver.c with timer. Currently it uses GUC variables to allow testing of different batch sizes and timeout values. With both patches applied we’ve noticed a significant reduction in CPU consumption while using the synthetic test program mentioned above. If only the second patch is applied to the build, then problem with CV variable could be reproduced by building server with casserts enabled and running TAP tests in ‘src/test/recovery’ directory. It would be great to hear any thoughts on these observations and fixing approaches, as well as possible pitfalls of proposed changes. Thanks, Alexey
Attachment
Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
From
Alexey Makhmutov
Date:
> We’ve prepared two test patches on top of current master to address both issues: > ... > * 0002-Implement-batching-for-cascade-replication.patch – test patch to implement possible batching approach in xlogreceiver.c with timer. Currently it uses GUC variables to allow testing of different batch sizes and timeout values. I've played with the second patch a little more and made some adjustments to it: 1. Setup timer only if we actually have applied messages, which are (potentially) not yet signaled to walsenders. 2. Notify logical walsenders without delay if time line has changed. Modified patch is attached. Thanks, Alexey