Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Eager page freeze criteria clarification |
Date | |
Msg-id | CA+Tgmoa=0XhJ=Eo61m8vor4wUki2hhJoCT-syukidHEvsa+DmQ@mail.gmail.com Whole thread Raw |
In response to | Re: Eager page freeze criteria clarification (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Eager page freeze criteria clarification
Re: Eager page freeze criteria clarification |
List | pgsql-hackers |
On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan <pg@bowt.ie> wrote: > Assuming your concern is more or less limited to those cases where the > same page could be frozen an unbounded number of times (once or almost > once per VACUUM), then I think we fully agree. We ought to converge on > the right behavior over time, but it's particularly important that we > never converge on the wrong behavior instead. I think that more or less matches my current thinking on the subject. A caveat might be: If it were once per two vacuums rather than once per vacuum, that might still be an issue. But I agree with the idea that the case that matters is *repeated* wasteful freezing. I don't think freezing is expensive enough that individual instances of mistaken freezing are worth getting too stressed about, but as you say, the overall pattern does matter. > The TPC-C scenario is partly interesting because it isn't actually > obvious what the most desirable behavior is, even assuming that you > had perfect information, and were not subject to practical > considerations about the complexity of your algorithm. There doesn't > seem to be perfect clarity on what the goal should actually be in such > scenarios -- it's not like the problem is just that we can't agree on > the best way to accomplish those goals with this specific workload. > > If performance/efficiency and performance stability are directly in > tension (as they sometimes are), how much do you want to prioritize > one or the other? It's not an easy question to answer. It's a value > judgement as much as anything else. I think that's true. For me, the issue is what a user is practically likely to notice and care about. I submit that on a not-particularly-busy system, it would probably be fine to freeze aggressively in almost every situation, because you're only incurring costs you can afford to pay. On a busy system, it's more important to be right, or at least not too badly wrong. But even on a busy system, I think that when the time between data being written and being frozen is more than a few tens of minutes, it's very doubtful that anyone is going to notice the contribution that freezing makes to the overall workload. They're much more likely to notice an annoying autovacuum than they are to notIce a bit of excess freezing that ends up getting reversed. But when you start cranking the time between writing data and freezing it down into the single-digit numbers of minutes, and even more if you push down to tens of seconds or less, now I think people are going to care more about useless freezing work than about long-term autovacuum risks. Because now their database is really busy so they care a lot about performance, and seemingly most of the data involved is ephemeral anyway. > Even if you're willing to assume that vacuum_freeze_min_age isn't just > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age > is applied by VACUUM, at the point that it scans pages. If VACUUM were > infinitely fast, and new VACUUMs were launched constantly, then > vacuum_freeze_min_age (and this bucketing scheme) might make more > sense. But, you know, they're not. So whether or not VACUUM (with > Andres' algorithm) deems a page that it has frozen to have been > opportunistically frozen or not is greatly influenced by factors that > couldn't possibly be relevant. I'm not totally sure that I'm understanding what you're concerned about here, but I *think* that the issue you're worried about here is: if we have various rules that can cause freezing, let's say X Y and Z, and we adjust the aggressiveness of rule X based on the performance of rule Y, that would be stupid and might suck. Assuming that the previous sentence is a correct framing, let's take X to be "freezing based on the page LSN age" and Y to be "freezing based on vacuum_freeze_min_age". I think the problem scenario here would be if it turns out that, under some set of circumstances, Y freezes more aggressively than X. For example, suppose the user runs VACUUM FREEZE, effectively setting vacuum_freeze_min_age=0 for that operation. If the table is being modified at all, it's likely to suffer a bunch of unfreezing right afterward, which could cause us to decide to make future vacuums freeze less aggressively. That's not necessarily what we want, because evidently the user, at least at that moment in time, thought that previous freezing hadn't been aggressive enough. They might be surprised to find that flash-freezing the table inhibited future automatic freezing. Or suppose that they just have a very high XID consumption rate compared to the rate of modifications to this particular table, such that criteria related to vacuum_freeze_min_age tend to be satisfied a lot, and thus vacuums tend to freeze a lot no matter what the page LSN age is. This scenario actually doesn't seem like a problem, though. In this case the freezing criterion based on page LSN age is already not getting used, so it doesn't really matter whether we tune it up or down or whatever. The earlier scenario, where the user ran VACUUM FREEZE, is weirder, but it doesn't sound that horrible, either. I did stop to wonder if we should just remove vacuum_freeze_min_age entirely, but I don't really see how to make that work. If we just always froze everything, then I guess we wouldn't need that value, because we would have effectively hard-coded it to zero. But if not, we need some kind of backstop to make sure that XID age eventually triggers freezing even if nothing else does, and vacuum_freeze_min_age is that thing. So I agree there could maybe be some kind of problem in this area, but I'm not quite seeing it. > Okay then. I guess it's more accurate to say that we'll have a strong > bias in the direction of freezing when an FPI won't result, though not > an infinitely strong bias. We'll at least have something that can be > thought of as an improved version of the FPI thing for 17, I think -- > which is definitely significant progress. I do kind of wonder whether we're going to care about the FPI thing in the end. I don't mind if we do. But I wonder if it will prove necessary, or even desirable. Andres's algorithm requires a greater LSN age to trigger freezing when an FPI is required than when one isn't. But Melanie's test results seem to me to show that using a small LSN distance freezes too much on pgbench_accounts-type workloads and using a large one freezes too little on insert-only workloads. So I'm currently feeling a lot of skepticism about how useful it is to vary the LSN-distance threshold as a way of controlling the behavior. Maybe that intuition is misplaced, or maybe it will turn out that we can use the FPI criterion in some more satisfying way than using it to frob the LSN distance. But if the algorithm does an overall good job guessing whether pages are likely to be modified again soon, then why care about whether an FPI is required? And if it doesn't, is caring about FPIs good enough to save us? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: