Thread: Patch proposal - parameter to limit amount of FPW because of hint bits per second

Patch proposal - parameter to limit amount of FPW because of hint bits per second

From
Michail Nikolaev
Date:
Hello, hackers.

We have a production cluster with 10 hot standby servers. Each server
has 48 cores and 762Mbs network.

We have experienced multiple temporary downtimes caused by long
transactions and hint bits.

For example - we are creating a new big index. It could take even a
day sometimes. Also, there are some tables with frequently updating
indexes (HOT is not used for such tables). Of course, after some time
we have experienced higher CPU usage because of tons of “dead” tuples
in index and heap. But everything is still working.

But real issues come once a long-lived transaction is finally
finished. Next index and heap scans start to mark millions of records
with the LP_DEAD flag. And it causes a ton of FPW records in WAL. It
is impossible to quickly transfer such an amount through the network
(or even write to the disk) - and the primary server becomes
unavailable with the whole system.

You can check the graphic of primary resources for real downtime
incident in the attachment.

So, I was thinking about a way to avoid such downtimes. What is about
a patch to add parameters to limit the number of FPW caused by LP_DEAD
bits per second? It is always possible to skip the setting of LP_DEAD
for future time. Such a parameter will make it possible to spread all
additional WAL traffic over time by some Mbit/s.

Does it look worth its implementation?

Thanks,
Michail.

Attachment
On Sun, Mar 20, 2022 at 12:44 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
> So, I was thinking about a way to avoid such downtimes. What is about
> a patch to add parameters to limit the number of FPW caused by LP_DEAD
> bits per second? It is always possible to skip the setting of LP_DEAD
> for future time. Such a parameter will make it possible to spread all
> additional WAL traffic over time by some Mbit/s.
>
> Does it look worth its implementation?

The following approach seems like it might fix the problem in the way
that you hope for:

* Add code to _bt_killitems() that detects if it has generated an FPI,
just to set some LP_DEAD bits.

* Instead of avoiding the FPI when this happens, proactively call
_bt_simpledel_pass() just before _bt_killitems() returns. Accept the
immediate cost of setting an LP_DEAD bit, just like today, but avoid
repeated FPIs.

The idea here is to take advantage of the enhancements to LP_DEAD
index tuple deletion (or "simple deletion") in Postgres 14.
_bt_simpledel_pass() will now do a good job of deleting "extra" heap
TIDs in practice, with many workloads. So in your scenario it's likely
that the proactive index tuple deletions will be able to delete many
"extra" nearby index tuples whose TIDs point to the same heap page.

This will be useful to you because it cuts down on repeated FPIs for
the same leaf page. You still get the FPIs, but in practice you may
get far fewer of them by triggering these proactive deletions, that
can easily delete many TIDs in batch. I think that it's better to
pursue an approach like this because it's more general.

It would perhaps also make sense to not set LP_DEAD bits in
_bt_killitems() when we see that doing so right now generates an FPI,
*and* we also see that existing LP_DEAD markings are enough to make
_bt_simpledel_pass() delete the index tuple that we want to mark
LP_DEAD now, anyway (because it'll definitely visit the same heap
block later on). That does mean that we pay a small cost, but at least
we won't miss out on deleting any index tuples as a result of avoiding
an FPI. This second idea is also much more general than simply
avoiding FPIs in general.

-- 
Peter Geoghegan



Hello, Peter.

> * Instead of avoiding the FPI when this happens, proactively call
> _bt_simpledel_pass() just before _bt_killitems() returns. Accept the
> immediate cost of setting an LP_DEAD bit, just like today, but avoid
> repeated FPIs.

Hm, not sure here
AFAIK current implementation does not produce repeated FPIs. Page is
marked as dirty on the first bit. So, others LP_DEAD (if not set by
single scan) do not generate FPI until checkpoint is ready.
Also, the issue affects GITS and HASH indexes and HEAP pages.

Best regards,
Michail.



On Mon, Mar 21, 2022 at 12:58 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
> Hm, not sure here
> AFAIK current implementation does not produce repeated FPIs. Page is
> marked as dirty on the first bit. So, others LP_DEAD (if not set by
> single scan) do not generate FPI until checkpoint is ready.

There is one FPI per checkpoint for any leaf page that is modified
during that checkpoint. The difference between having that happen once
or twice per leaf page and having that happen many more times per leaf
page could be very large.

Of course it's true that that might not make that much difference. Who
knows? But if you're not willing to measure it then we'll never know.
What version are you using here? How frequently were checkpoints
occurring in the period in question, and how does that compare to
normal? You didn't even include this basic information.

Many things have changed in this area already, and it's rather unclear
how much just upgrading to Postgres 14 would help. I think that it's
possible that it would help you here a great deal. I also think it's
possible that it wouldn't help at all. I don't know which it is, and I
wouldn't expect to know without careful testing -- it's too
complicated, and likely would be even if all of the information about
the application is available.

The main reason that this can be so complex is that FPIs are caused by
more frequent checkpoints, but *also* cause more frequent checkpoints
in turn. So you could have a "death spiral" with FPIs -- the effect is
nonlinear, which has the potential to lead to pathological, chaotic
behavior. The impact on response time is *also* nonlinear and chaotic,
in turn.

Sometimes it's possible to address things like this quite well with
relatively simple solutions, that at least work well in most cases --
just avoiding getting into a "death spiral" might be all it takes. As
I said, maybe that won't be possible here, but it should be carefully
considered first. Not setting LP_DEAD bits because there are currently
"too many FPIs" requires defining what that actually means, which
seems very difficult because of these nonlinear dynamics. What do you
do when there were too many FPIs for a long time, but also too much
avoiding them earlier on? It's very complicated.

That's why I'm emphasizing solutions that focus on limiting the
downside of not setting LP_DEAD bits, which is local information (not
system wide information) that is much easier to understand and target
in the implementation.

-- 
Peter Geoghegan



Hello, Peter.

Thanks for your comments.

> There is one FPI per checkpoint for any leaf page that is modified
> during that checkpoint. The difference between having that happen once
> or twice per leaf page and having that happen many more times per leaf
> page could be very large.

Yes, I am almost sure proactively calling of_bt_simpledel_pass() will
positively impact the system on many workloads. But also I am almost
sure it will not change the behavior of the incident I mention -
because it is not related to multiple checkpoints.

> Of course it's true that that might not make that much difference. Who
> knows? But if you're not willing to measure it then we'll never know.
> What version are you using here? How frequently were checkpoints
> occurring in the period in question, and how does that compare to
> normal? You didn't even include this basic information.

Yes, I probably had to provide more details. Downtime is pretty short
(you could see network peak on telemetry image from the first letter)
- so, just 1-3 minutes. Checkpoints are about each 30 min.
It is just an issue with super-high WAL traffic caused by tons of FPI
traffic after a long transaction commit. The issue resolved fast on
its own, but downtime still happens.

> Many things have changed in this area already, and it's rather unclear
> how much just upgrading to Postgres 14 would help.

Version is 11. Yes, many things have changed but IFAIK nothing's
changed related to FPI mechanics (LP_DEAD and other hint bits,
including HEAP).

I could probably try to reproduce the issue, but I'm not sure how to
do it in a fast and reliable way (it is hard to wait for a day for
each test). Probably it may be possible by some temporary crutch in
postgres source (to emulate old transaction commit somehow).

> The main reason that this can be so complex is that FPIs are caused by
> more frequent checkpoints, but *also* cause more frequent checkpoints
> in turn. So you could have a "death spiral" with FPIs -- the effect is
> nonlinear, which has the potential to lead to pathological, chaotic
> behavior. The impact on response time is *also* nonlinear and chaotic,
> in turn.

Could you please explain "death spiral" mechanics related to FPIs?

> What do you do when there were too many FPIs for a long time, but also too much
> avoiding them earlier on? It's very complicated.

Yes, it could cause at least performance degradation in case of too
aggressive avoiding the FPI. I am 100% sure such settings should be
disabled by default. It is more about the physical limits of servers.
Personally I would like to set it to about 75% of resources.

Also, there are some common things between checkpoints and vacuum -
they are processes which are required to be done regularly (but not
right now) and they are limited in resources. Setting LP_DEAD (and
other hint bits, especially in HEAP) is also something required to be
done regularly (but not right now). But it is not limited by
resources.

BTW, probably new index creation is something with the same nature.

Best regards,
Michail.



Hello, Peter.

>> * Add code to _bt_killitems() that detects if it has generated an FPI,
>> just to set some LP_DEAD bits.
>> * Instead of avoiding the FPI when this happens, proactively call
>> _bt_simpledel_pass() just before _bt_killitems() returns. Accept the
>> immediate cost of setting an LP_DEAD bit, just like today, but avoid
>> repeated FPIs.

> Yes, I am almost sure proactively calling of_bt_simpledel_pass() will
> positively impact the system on many workloads. But also I am almost
> sure it will not change the behavior of the incident I mention -
> because it is not related to multiple checkpoints.

I just realized what it seems to be dangerous approache because of
locking mechanism.
Currently _bt_killitems requires only read lock but _bt_simpledel_pass
required write lock (it ends with _bt_delitems_delete).
It will required to increase locking mode in order to call _bt_simpledel_pass.

Such a change may negatively affect many workloads because of write
lock during scanning - and it is really hard to to prove absence of
regression (have no idea how).

Thanks,
Michail.