Thread: Patch proposal - parameter to limit amount of FPW because of hint bits per second
Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Michail Nikolaev
Date:
Hello, hackers. We have a production cluster with 10 hot standby servers. Each server has 48 cores and 762Mbs network. We have experienced multiple temporary downtimes caused by long transactions and hint bits. For example - we are creating a new big index. It could take even a day sometimes. Also, there are some tables with frequently updating indexes (HOT is not used for such tables). Of course, after some time we have experienced higher CPU usage because of tons of “dead” tuples in index and heap. But everything is still working. But real issues come once a long-lived transaction is finally finished. Next index and heap scans start to mark millions of records with the LP_DEAD flag. And it causes a ton of FPW records in WAL. It is impossible to quickly transfer such an amount through the network (or even write to the disk) - and the primary server becomes unavailable with the whole system. You can check the graphic of primary resources for real downtime incident in the attachment. So, I was thinking about a way to avoid such downtimes. What is about a patch to add parameters to limit the number of FPW caused by LP_DEAD bits per second? It is always possible to skip the setting of LP_DEAD for future time. Such a parameter will make it possible to spread all additional WAL traffic over time by some Mbit/s. Does it look worth its implementation? Thanks, Michail.
Attachment
Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Peter Geoghegan
Date:
On Sun, Mar 20, 2022 at 12:44 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > So, I was thinking about a way to avoid such downtimes. What is about > a patch to add parameters to limit the number of FPW caused by LP_DEAD > bits per second? It is always possible to skip the setting of LP_DEAD > for future time. Such a parameter will make it possible to spread all > additional WAL traffic over time by some Mbit/s. > > Does it look worth its implementation? The following approach seems like it might fix the problem in the way that you hope for: * Add code to _bt_killitems() that detects if it has generated an FPI, just to set some LP_DEAD bits. * Instead of avoiding the FPI when this happens, proactively call _bt_simpledel_pass() just before _bt_killitems() returns. Accept the immediate cost of setting an LP_DEAD bit, just like today, but avoid repeated FPIs. The idea here is to take advantage of the enhancements to LP_DEAD index tuple deletion (or "simple deletion") in Postgres 14. _bt_simpledel_pass() will now do a good job of deleting "extra" heap TIDs in practice, with many workloads. So in your scenario it's likely that the proactive index tuple deletions will be able to delete many "extra" nearby index tuples whose TIDs point to the same heap page. This will be useful to you because it cuts down on repeated FPIs for the same leaf page. You still get the FPIs, but in practice you may get far fewer of them by triggering these proactive deletions, that can easily delete many TIDs in batch. I think that it's better to pursue an approach like this because it's more general. It would perhaps also make sense to not set LP_DEAD bits in _bt_killitems() when we see that doing so right now generates an FPI, *and* we also see that existing LP_DEAD markings are enough to make _bt_simpledel_pass() delete the index tuple that we want to mark LP_DEAD now, anyway (because it'll definitely visit the same heap block later on). That does mean that we pay a small cost, but at least we won't miss out on deleting any index tuples as a result of avoiding an FPI. This second idea is also much more general than simply avoiding FPIs in general. -- Peter Geoghegan
Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Michail Nikolaev
Date:
Hello, Peter. > * Instead of avoiding the FPI when this happens, proactively call > _bt_simpledel_pass() just before _bt_killitems() returns. Accept the > immediate cost of setting an LP_DEAD bit, just like today, but avoid > repeated FPIs. Hm, not sure here AFAIK current implementation does not produce repeated FPIs. Page is marked as dirty on the first bit. So, others LP_DEAD (if not set by single scan) do not generate FPI until checkpoint is ready. Also, the issue affects GITS and HASH indexes and HEAP pages. Best regards, Michail.
Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Peter Geoghegan
Date:
On Mon, Mar 21, 2022 at 12:58 AM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > Hm, not sure here > AFAIK current implementation does not produce repeated FPIs. Page is > marked as dirty on the first bit. So, others LP_DEAD (if not set by > single scan) do not generate FPI until checkpoint is ready. There is one FPI per checkpoint for any leaf page that is modified during that checkpoint. The difference between having that happen once or twice per leaf page and having that happen many more times per leaf page could be very large. Of course it's true that that might not make that much difference. Who knows? But if you're not willing to measure it then we'll never know. What version are you using here? How frequently were checkpoints occurring in the period in question, and how does that compare to normal? You didn't even include this basic information. Many things have changed in this area already, and it's rather unclear how much just upgrading to Postgres 14 would help. I think that it's possible that it would help you here a great deal. I also think it's possible that it wouldn't help at all. I don't know which it is, and I wouldn't expect to know without careful testing -- it's too complicated, and likely would be even if all of the information about the application is available. The main reason that this can be so complex is that FPIs are caused by more frequent checkpoints, but *also* cause more frequent checkpoints in turn. So you could have a "death spiral" with FPIs -- the effect is nonlinear, which has the potential to lead to pathological, chaotic behavior. The impact on response time is *also* nonlinear and chaotic, in turn. Sometimes it's possible to address things like this quite well with relatively simple solutions, that at least work well in most cases -- just avoiding getting into a "death spiral" might be all it takes. As I said, maybe that won't be possible here, but it should be carefully considered first. Not setting LP_DEAD bits because there are currently "too many FPIs" requires defining what that actually means, which seems very difficult because of these nonlinear dynamics. What do you do when there were too many FPIs for a long time, but also too much avoiding them earlier on? It's very complicated. That's why I'm emphasizing solutions that focus on limiting the downside of not setting LP_DEAD bits, which is local information (not system wide information) that is much easier to understand and target in the implementation. -- Peter Geoghegan
Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Michail Nikolaev
Date:
Hello, Peter. Thanks for your comments. > There is one FPI per checkpoint for any leaf page that is modified > during that checkpoint. The difference between having that happen once > or twice per leaf page and having that happen many more times per leaf > page could be very large. Yes, I am almost sure proactively calling of_bt_simpledel_pass() will positively impact the system on many workloads. But also I am almost sure it will not change the behavior of the incident I mention - because it is not related to multiple checkpoints. > Of course it's true that that might not make that much difference. Who > knows? But if you're not willing to measure it then we'll never know. > What version are you using here? How frequently were checkpoints > occurring in the period in question, and how does that compare to > normal? You didn't even include this basic information. Yes, I probably had to provide more details. Downtime is pretty short (you could see network peak on telemetry image from the first letter) - so, just 1-3 minutes. Checkpoints are about each 30 min. It is just an issue with super-high WAL traffic caused by tons of FPI traffic after a long transaction commit. The issue resolved fast on its own, but downtime still happens. > Many things have changed in this area already, and it's rather unclear > how much just upgrading to Postgres 14 would help. Version is 11. Yes, many things have changed but IFAIK nothing's changed related to FPI mechanics (LP_DEAD and other hint bits, including HEAP). I could probably try to reproduce the issue, but I'm not sure how to do it in a fast and reliable way (it is hard to wait for a day for each test). Probably it may be possible by some temporary crutch in postgres source (to emulate old transaction commit somehow). > The main reason that this can be so complex is that FPIs are caused by > more frequent checkpoints, but *also* cause more frequent checkpoints > in turn. So you could have a "death spiral" with FPIs -- the effect is > nonlinear, which has the potential to lead to pathological, chaotic > behavior. The impact on response time is *also* nonlinear and chaotic, > in turn. Could you please explain "death spiral" mechanics related to FPIs? > What do you do when there were too many FPIs for a long time, but also too much > avoiding them earlier on? It's very complicated. Yes, it could cause at least performance degradation in case of too aggressive avoiding the FPI. I am 100% sure such settings should be disabled by default. It is more about the physical limits of servers. Personally I would like to set it to about 75% of resources. Also, there are some common things between checkpoints and vacuum - they are processes which are required to be done regularly (but not right now) and they are limited in resources. Setting LP_DEAD (and other hint bits, especially in HEAP) is also something required to be done regularly (but not right now). But it is not limited by resources. BTW, probably new index creation is something with the same nature. Best regards, Michail.
Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second
From
Michail Nikolaev
Date:
Hello, Peter. >> * Add code to _bt_killitems() that detects if it has generated an FPI, >> just to set some LP_DEAD bits. >> * Instead of avoiding the FPI when this happens, proactively call >> _bt_simpledel_pass() just before _bt_killitems() returns. Accept the >> immediate cost of setting an LP_DEAD bit, just like today, but avoid >> repeated FPIs. > Yes, I am almost sure proactively calling of_bt_simpledel_pass() will > positively impact the system on many workloads. But also I am almost > sure it will not change the behavior of the incident I mention - > because it is not related to multiple checkpoints. I just realized what it seems to be dangerous approache because of locking mechanism. Currently _bt_killitems requires only read lock but _bt_simpledel_pass required write lock (it ends with _bt_delitems_delete). It will required to increase locking mode in order to call _bt_simpledel_pass. Such a change may negatively affect many workloads because of write lock during scanning - and it is really hard to to prove absence of regression (have no idea how). Thanks, Michail.