Hello
Our current setup uses logical replication to build a BI replication server
along our primary clusters (running PG 10.10 so far). This implies having one
logical replication slot per database. After some analysis, we identified two
hot spots behind this issue. Fixing them gave us a 10 fold performance
improvement in decoding speed.
We noticed our load had gotten quite bigger on our primary since the
introduction of this replication, seeing spikes in system time when a lot of
wal were being written (for instance when creating GIN indexes).
The first hot spot is PostmasterIsAlive. The implementation reads on a pipe to
know if the postmaster is still alive, but this is very expensive kernel-wise.
When switching the implementation to a much more primitive (and probably
wrong):
bool PostmasterIsAliveInternal() {
return getppid() == PostmasterPid;
}
we stopped seeing spikes in system time.
But after doing that, the CPU time used by our walsenders increased. We
reached a second hot spot, this time in XLogSendLogical, where each walsender
was using 100% of user CPU for minutes. After checking with perf, it appears
all the decoders are fighting on GetFlushRecPtr.
On PostgreSQL 12, the call to PostmasterIsAlive is no longer present in
WalSndLoop (thanks to commit cfdf4dc4f), so only the second hot spot is still
present, with the same effects.
Attached to this email are two patches.
The first one, specific to PG 10 for our use-case, simplifies the
PostmasterIsAlive function, as described above. I don't know if this
implementation is valid, but it was needed to uncover the performance issue in
XLogSendLogical. Would it be possible to remove the systematic call to
PostmasterIsAlive in WalSndLoop? We are not certain of the behaviour.
The second one was tested on PG 10 and PG 12 (with 48 lines offset). It has on
PG12 the same effect it has on a PG10+isAlive patch. Instead of calling each
time GetFlushRecPtr, we call it only if we notice we have reached the value of
the previous call. This way, when the senders are busy decoding, we are no
longer fighting for a spinlock to read the FlushRecPtr.
Here are some benchmark results.
On PG 10, to decode our replication stream, we went from 3m 43s to over 5
minutes after removing the first hot spot, and then down to 22 seconds.
On PG 12, we had to change the benchmark (due to GIN indexes creation being
more optimized) so we can not compare directly with our previous bench. We
went from 15m 11s down to 59 seconds.
If needed, we can provide scripts to reproduce this situation. It is quite
simple: add ~20 walsenders doing logical replication in database A, and then
generate a lot of data in database B. The walsenders will be woken up by the
activity on database B, but not sending it thus keeping hitting the same
locks.
Regards