Hello, hackers.
We have a production cluster with 10 hot standby servers. Each server
has 48 cores and 762Mbs network.
We have experienced multiple temporary downtimes caused by long
transactions and hint bits.
For example - we are creating a new big index. It could take even a
day sometimes. Also, there are some tables with frequently updating
indexes (HOT is not used for such tables). Of course, after some time
we have experienced higher CPU usage because of tons of “dead” tuples
in index and heap. But everything is still working.
But real issues come once a long-lived transaction is finally
finished. Next index and heap scans start to mark millions of records
with the LP_DEAD flag. And it causes a ton of FPW records in WAL. It
is impossible to quickly transfer such an amount through the network
(or even write to the disk) - and the primary server becomes
unavailable with the whole system.
You can check the graphic of primary resources for real downtime
incident in the attachment.
So, I was thinking about a way to avoid such downtimes. What is about
a patch to add parameters to limit the number of FPW caused by LP_DEAD
bits per second? It is always possible to skip the setting of LP_DEAD
for future time. Such a parameter will make it possible to spread all
additional WAL traffic over time by some Mbit/s.
Does it look worth its implementation?
Thanks,
Michail.