Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
Date
Msg-id CANtu0ogusg8cP+vCv_skiBs25hU-4kBJLqDbcRF2cpXqu9PK+w@mail.gmail.com
Whole thread Raw
In response to Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second  (Michail Nikolaev <michail.nikolaev@gmail.com>)
List pgsql-hackers
Hello, Peter.

Thanks for your comments.

> There is one FPI per checkpoint for any leaf page that is modified
> during that checkpoint. The difference between having that happen once
> or twice per leaf page and having that happen many more times per leaf
> page could be very large.

Yes, I am almost sure proactively calling of_bt_simpledel_pass() will
positively impact the system on many workloads. But also I am almost
sure it will not change the behavior of the incident I mention -
because it is not related to multiple checkpoints.

> Of course it's true that that might not make that much difference. Who
> knows? But if you're not willing to measure it then we'll never know.
> What version are you using here? How frequently were checkpoints
> occurring in the period in question, and how does that compare to
> normal? You didn't even include this basic information.

Yes, I probably had to provide more details. Downtime is pretty short
(you could see network peak on telemetry image from the first letter)
- so, just 1-3 minutes. Checkpoints are about each 30 min.
It is just an issue with super-high WAL traffic caused by tons of FPI
traffic after a long transaction commit. The issue resolved fast on
its own, but downtime still happens.

> Many things have changed in this area already, and it's rather unclear
> how much just upgrading to Postgres 14 would help.

Version is 11. Yes, many things have changed but IFAIK nothing's
changed related to FPI mechanics (LP_DEAD and other hint bits,
including HEAP).

I could probably try to reproduce the issue, but I'm not sure how to
do it in a fast and reliable way (it is hard to wait for a day for
each test). Probably it may be possible by some temporary crutch in
postgres source (to emulate old transaction commit somehow).

> The main reason that this can be so complex is that FPIs are caused by
> more frequent checkpoints, but *also* cause more frequent checkpoints
> in turn. So you could have a "death spiral" with FPIs -- the effect is
> nonlinear, which has the potential to lead to pathological, chaotic
> behavior. The impact on response time is *also* nonlinear and chaotic,
> in turn.

Could you please explain "death spiral" mechanics related to FPIs?

> What do you do when there were too many FPIs for a long time, but also too much
> avoiding them earlier on? It's very complicated.

Yes, it could cause at least performance degradation in case of too
aggressive avoiding the FPI. I am 100% sure such settings should be
disabled by default. It is more about the physical limits of servers.
Personally I would like to set it to about 75% of resources.

Also, there are some common things between checkpoints and vacuum -
they are processes which are required to be done regularly (but not
right now) and they are limited in resources. Setting LP_DEAD (and
other hint bits, especially in HEAP) is also something required to be
done regularly (but not right now). But it is not limited by
resources.

BTW, probably new index creation is something with the same nature.

Best regards,
Michail.



pgsql-hackers by date:

Previous
From: Maxim Orlov
Date:
Subject: Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
Next
From: Amit Kapila
Date:
Subject: Re: logical decoding and replication of sequences