Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Eager page freeze criteria clarification |
Date | |
Msg-id | CA+TgmoYcWisxLBL-pXu13OevThLOXm20oJqjNRZtKkhXsY92XA@mail.gmail.com Whole thread Raw |
In response to | Re: Eager page freeze criteria clarification (Melanie Plageman <melanieplageman@gmail.com>) |
List | pgsql-hackers |
On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > At the risk of seeming too execution-focused, I want to try and get more > specific. Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing When I first read this, I thought this sounded right, except for the naming which doesn't mention freezing, but on further thought, this feels like it might not be the right thing. Why do I care how many super-old pages (regardless of how exactly we define super-old) I unfroze? I think what I care about is more like: how often do I have to unfreeze a recently-frozen page in order to complete a DML operation? For example, consider a table with a million pages, 10 of which are frozen. 1 was frozen a long time ago, 9 were frozen recently, and the other 999,990 are unfrozen. I update every page in the table. Well, now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the two is terrible: 90% of the frozen pages I encountered were recently-frozen. But that's irrelevant. What matters is that if I had frozen less aggressively, I could have saved myself 9 unfreeze operations across a million page modifications. So actually I'm doing great: only 0.0009% of my page modifications required an unfreeze that I maybe should have avoided and didn't. Likewise, if all of the pages had been frozen, but only 9 were frozen recently, I'm doing exactly equally great. But if the number of recently frozen pages were higher, then I'd be doing less great. So I think comparing the number of young pages unfrozen to the number of old pages unfrozen is the wrong test, and comparing it instead to the number of pages modified is a better test. But I don't think it's completely right, either. Imagine a table with 100 recently frozen pages. I modify the table and unfreeze one of them. So, 100% of the pages that I unfroze were recently-frozen. Does this mean that we should back off and freeze less aggressively? Not really. It seems like I actually got this case 99% right. So another idea could be to look at what percentage of frozen pages I end up un-freezing shortly thereafter. That gets this example right. But it gets the earlier example with a million pages wrong. Maybe there's something to combining these ideas. If the percentage of page modifications (of clean pages, say, so that we don't skew the stats as much when a page is modified many times in a row) that have to un-freeze a recently-frozen page is low, then that means there's no real performance penalty for whatever aggressive freezing we're doing. Even if we un-freeze a zillion pages, if that happens over the course of modifying 100 zillion pages, it's not material. On the flip side, if the percentage of frozen pages that get unfrozen soon thereafter is low, then it feels worth doing even if most page modifications end up un-freezing a recently-frozen page, because on net we're still ending up with a lot of extra stuff frozen. Said differently, we're freezing too aggressively if un-freezing is both adding a significant expense (i.e. the foreground work we're doing often requires un-freezing ages) and ineffective (i.e. we don't end up with frozen pages left over). If it has one of those problems, it's still OK, but if it has both, we need to back off. Maybe that's still not quite right, but it's the best I've got this morning. > You would likely want to combine it with one of the other heuristics we > discussed. > > For example: > For a table with only 20% younger unfreezings, when vacuuming that page, > > if insert LSN - RedoRecPtr < insert LSN - page LSN > page is older than the most recent checkpoint start, so freeze it > regardless of whether or not it would emit an FPI > > What aggressiveness levels should there be? What should change at each > level? What criteria should pages have to meet to be subject to the > aggressiveness level? My guess is that, when we're being aggressive, we should be aiming to freeze all pages that can be completely frozen with no additional criterion. Any additional criterion that we add here is likely to mess up the insert-only table case, and I'm not really sure what it saves us from. On the other hand, when we're being non-aggressive, we might still sometimes want to freeze in situations where we historically haven't. Consider the following three examples: 1. pgbench_accounts table, standard run 2. pgbench_accounts table, highly skewed run 3. slowly growing table with a hot tail. records are inserted, updated heavily for a while, thereafter updated only very occasionally. Aggressive freezing could misfire in all of these cases, because all of them have some portion of the table where rows are being updated very frequently. But in the second and third cases, there's also a cold portion of the table where aggressive freezing is still OK. Unless we could figure out how to get stats with page-level granularity, which sounds infeasible, I think cases (2) and (3) will be hard to get completely right. However, if we opportunistically froze any page that hadn't been modified in the last checkpoint cycle, or any page that hadn't been modified in the last 2 checkpoint cycles, or any page we had to read into shared_buffers (following an earlier suggestion from Andres), or any page that hadn't been modified in the last 5 minutes, it seems like we could get a bit of extra freezing that we're not getting today without too many downsides. We probably need whatever criterion we use here to be pretty lenient to avoid freezing the hot pages, which probably means we'll sometimes fail to freeze pages that are actually cold but we don't quite realize it. But that could still be better than today. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: