On Mon, Mar 21, 2022 at 12:58 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
> Hm, not sure here
> AFAIK current implementation does not produce repeated FPIs. Page is
> marked as dirty on the first bit. So, others LP_DEAD (if not set by
> single scan) do not generate FPI until checkpoint is ready.
There is one FPI per checkpoint for any leaf page that is modified
during that checkpoint. The difference between having that happen once
or twice per leaf page and having that happen many more times per leaf
page could be very large.
Of course it's true that that might not make that much difference. Who
knows? But if you're not willing to measure it then we'll never know.
What version are you using here? How frequently were checkpoints
occurring in the period in question, and how does that compare to
normal? You didn't even include this basic information.
Many things have changed in this area already, and it's rather unclear
how much just upgrading to Postgres 14 would help. I think that it's
possible that it would help you here a great deal. I also think it's
possible that it wouldn't help at all. I don't know which it is, and I
wouldn't expect to know without careful testing -- it's too
complicated, and likely would be even if all of the information about
the application is available.
The main reason that this can be so complex is that FPIs are caused by
more frequent checkpoints, but *also* cause more frequent checkpoints
in turn. So you could have a "death spiral" with FPIs -- the effect is
nonlinear, which has the potential to lead to pathological, chaotic
behavior. The impact on response time is *also* nonlinear and chaotic,
in turn.
Sometimes it's possible to address things like this quite well with
relatively simple solutions, that at least work well in most cases --
just avoiding getting into a "death spiral" might be all it takes. As
I said, maybe that won't be possible here, but it should be carefully
considered first. Not setting LP_DEAD bits because there are currently
"too many FPIs" requires defining what that actually means, which
seems very difficult because of these nonlinear dynamics. What do you
do when there were too many FPIs for a long time, but also too much
avoiding them earlier on? It's very complicated.
That's why I'm emphasizing solutions that focus on limiting the
downside of not setting LP_DEAD bits, which is local information (not
system wide information) that is much easier to understand and target
in the implementation.
--
Peter Geoghegan