Thread: autovacuum not prioritising for-wraparound tables
Hi, I have a bug pending that autovacuum fails to give priority to for-wraparound tables. When xid consumption rate is high and dead tuple creation is also high, it is possible that some tables are waiting for for-wraparound vacuums that don't complete in time because the workers are busy processing other tables that have accumulated dead tuples; the system is then down because it's too near the Xid wraparound horizon. Apparently this is particularly notorious in connection with TOAST tables, because those are always put in the tables-to-process list after regular tables. (As far as I recall, this was already reported elsewhere, but so far I have been unable to find the discussion in the archives. Pointers appreciated.) So here's a small, backpatchable patch that sorts the list of tables to process (not all that much tested yet). Tables which have the wraparound flag set are processed before those that are not. Other than this criterion, the order is not defined. Now we could implement this differently, and maybe more simply (say by keeping two lists of tables to process, one with for-wraparound tables and one with the rest) but this way it is simpler to add additional sorting criteria later: say within each category we could first process smaller tables that have more dead tuples. My intention is to clean this up and backpatch to all live branches. Comments? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 24.01.2013 23:57, Alvaro Herrera wrote: > I have a bug pending that autovacuum fails to give priority to > for-wraparound tables. When xid consumption rate is high and dead tuple > creation is also high, it is possible that some tables are waiting for > for-wraparound vacuums that don't complete in time because the workers > are busy processing other tables that have accumulated dead tuples; the > system is then down because it's too near the Xid wraparound horizon. > Apparently this is particularly notorious in connection with TOAST > tables, because those are always put in the tables-to-process list after > regular tables. > > (As far as I recall, this was already reported elsewhere, but so far I > have been unable to find the discussion in the archives. Pointers > appreciated.) > > So here's a small, backpatchable patch that sorts the list of tables to > process (not all that much tested yet). Tables which have the > wraparound flag set are processed before those that are not. Other > than this criterion, the order is not defined. > > Now we could implement this differently, and maybe more simply (say by > keeping two lists of tables to process, one with for-wraparound tables > and one with the rest) but this way it is simpler to add additional > sorting criteria later: say within each category we could first process > smaller tables that have more dead tuples. > > My intention is to clean this up and backpatch to all live branches. > Comments? Backpatching sounds a bit scary. It's not a clear-cut bug, it's just that autovacuum could be smarter about its priorities. There are other ways you can still bump into the xid-wraparound issue, even with this patch. - Heikki
On Thu, Jan 24, 2013 at 5:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Backpatching sounds a bit scary. It's not a clear-cut bug, it's just that > autovacuum could be smarter about its priorities. There are other ways you > can still bump into the xid-wraparound issue, even with this patch. I don't think this is a single-priority issue. It's *also* crucial that small tables with high "tuple attrition rates" get vacuumed extremely frequently; your system will bog down, albeit in a different way, if the small tables don't get vacuumed enough. This seems to me to involve multiple competing priorities where the main solution *I* can think of is to have multiple backends doing autovacuum, and assigning some to XID activity and others to the "small, needs vacuuming frequently" tables. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Christopher Browne <cbbrowne@gmail.com> writes: > On Thu, Jan 24, 2013 at 5:22 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Backpatching sounds a bit scary. It's not a clear-cut bug, it's just that >> autovacuum could be smarter about its priorities. There are other ways you >> can still bump into the xid-wraparound issue, even with this patch. > I don't think this is a single-priority issue. It's *also* crucial > that small tables with high "tuple attrition rates" get vacuumed > extremely frequently; your system will bog down, albeit in a different > way, if the small tables don't get vacuumed enough. Yeah. Another problem with a simple-minded priority arrangement is that it might cause some tables to get starved for service because workers keep on choosing other ones; we have to be sure the sorting rule is designed to prevent that. As posted, what we've got here is sorting on a boolean condition, with the behavior within each group totally up to the whims of qsort(). That seems especially dangerous since the priority order is mostly undefined. I was a bit surprised that Alvaro didn't propose sorting by the age of relfrozenxid, at least for the subset of tables that are considered wraparound hazards. Not sure what a good criterion is for the rest. regards, tom lane
Hi Alvaro, Nice to see a patch on this! On 2013-01-24 18:57:15 -0300, Alvaro Herrera wrote: > I have a bug pending that autovacuum fails to give priority to > for-wraparound tables. When xid consumption rate is high and dead tuple > creation is also high, it is possible that some tables are waiting for > for-wraparound vacuums that don't complete in time because the workers > are busy processing other tables that have accumulated dead tuples; the > system is then down because it's too near the Xid wraparound horizon. > Apparently this is particularly notorious in connection with TOAST > tables, because those are always put in the tables-to-process list after > regular tables. > > (As far as I recall, this was already reported elsewhere, but so far I > have been unable to find the discussion in the archives. Pointers > appreciated.) > > So here's a small, backpatchable patch that sorts the list of tables to > process (not all that much tested yet). Tables which have the > wraparound flag set are processed before those that are not. Other > than this criterion, the order is not defined. > > Now we could implement this differently, and maybe more simply (say by > keeping two lists of tables to process, one with for-wraparound tables > and one with the rest) but this way it is simpler to add additional > sorting criteria later: say within each category we could first process > smaller tables that have more dead tuples. If I remember the issue that triggered this correctly I don't think this would be sufficient to solve the whole issue although it sure would delay the shutdown. Due to the high activity on the system while some bigger, active table got vacuumed, other previously vacuumed tables already hit freeze_max_age again and thus they were reeligible for vacuum again even though other tables - in our the specific case always toast relations because they always got added last - were very short before the shutdown limit. So I think we need to sort by age(relfrozenxid) in tables that are over the anti-wraparound limit. Given your code that doesn't seem to be that hard? I think after the infrastructure is there we might want to have some more intelligence for non-wraparound tables too, but that possibly looks more like a HEAD than a backpatch thing. I am very much of the opinion that this needs to be backpatched though - its a pretty bad thing if autovacuum cannot be relied on to keep a system from shutting itself down because it always vacuums the wrong relations and never gets to the problematic ones. Single user mode is nothing normal users should ever have to see. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane escribió: > As posted, what we've got here is sorting on a boolean condition, with > the behavior within each group totally up to the whims of qsort(). That > seems especially dangerous since the priority order is mostly undefined. > > I was a bit surprised that Alvaro didn't propose sorting by the age of > relfrozenxid, at least for the subset of tables that are considered > wraparound hazards. Not sure what a good criterion is for the rest. Hah. This patch began life with more complex prioritisation at first, but before going much further I dumbed down the idea to avoid having to discuss these issues, as it doesn't seem a particularly good timing. And I do want to get something back-patchable. So if we're to discuss this, here's what I had in mind: 1. for-wraparound tables always go first; oldest age(relfrozenxid) are sorted earlier. For tables of the same age, consider size as below. 2. for other tables, consider floor(log(size)). This makes tables of sizes in the same ballpark be considered together. 3. For tables of similar size, consider (n_dead_tuples - threshold) / threshold. "threshold" is what gets calculated as the number of tuples over which a table is considered for vacuuming. This number, then, is a relative measure of how hard is vacuuming needed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1/25/13 10:29 AM, Alvaro Herrera wrote: > And I do want to get something back-patchable. Autovacuum has existed for N years and nobody complained about this until just now, so I don't see a strong justification for backpatching. Or is this a regression from an earlier release? In general, I think we should backpatch less.
Peter Eisentraut escribió: > On 1/25/13 10:29 AM, Alvaro Herrera wrote: > > And I do want to get something back-patchable. > > Autovacuum has existed for N years and nobody complained about this > until just now, so I don't see a strong justification for backpatching. I disagree about people not complaining. Maybe the complaints have not been specifically about the wraparound stuff and toast tables, but for sure there have been complaints about autovacuum not giving more priority to tables that need work more urgently. > Or is this a regression from an earlier release? Nope. > In general, I think we should backpatch less. I don't disagree with this general principle, but I certainly don't like the idea of letting systems run with known flaws just because we're too scared to patch them. Now I don't object to a plan such as keep it in master only for a while and backpatch after it has seen some more testing. But for large sites, this is a real problem and they have to work around it manually which is frequently inconvenient; keep in mind 9.0 is going to be supported for years yet. That said, if consensus here is to not backpatch this at all, I will go with that; but let's have the argument first. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Peter Eisentraut escribi�: >> Autovacuum has existed for N years and nobody complained about this >> until just now, so I don't see a strong justification for backpatching. > I disagree about people not complaining. Maybe the complaints have not > been specifically about the wraparound stuff and toast tables, but for > sure there have been complaints about autovacuum not giving more > priority to tables that need work more urgently. FWIW, I don't see that this is too scary to back-patch. It's unlikely to make things worse than the current coding, which is more or less pg_class tuple order. I do suggest that it might be wise not to try to squeeze it into the early-February update releases. Put it in master as soon as we agree on the behavior, then back-patch after the next updates. That will give us a couple months' testing, rather than a few days, before it hits any release tarballs. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > So if we're to discuss this, here's what I had in mind: > 1. for-wraparound tables always go first; oldest age(relfrozenxid) are > sorted earlier. For tables of the same age, consider size as below. It seems unlikely that age(relfrozenxid) will be identical for multiple tables often enough to worry about, so the second part of that seems like overcomplication. > 2. for other tables, consider floor(log(size)). This makes tables of > sizes in the same ballpark be considered together. > 3. For tables of similar size, consider > (n_dead_tuples - threshold) / threshold. > "threshold" is what gets calculated as the number of tuples over which > a table is considered for vacuuming. This number, then, is a relative > measure of how hard is vacuuming needed. The floor(log(size)) part seems like it will have rather arbitrary behavioral shifts when a table grows just past a log boundary. Also, I'm not exactly sure whether you're proposing smaller tables first or bigger tables first, nor that either of those orderings is a good thing. I think sorting by just age(relfrozenxid) for for-wraparound tables, and just the n_dead_tuples measurement for others, is probably reasonable for now. If we find out that has bad behaviors then we can look at how to fix them, but I don't think we have enough understanding yet of what the bad behaviors might be. regards, tom lane
On 2013-01-25 11:51:33 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > 2. for other tables, consider floor(log(size)). This makes tables of > > sizes in the same ballpark be considered together. > > > 3. For tables of similar size, consider > > (n_dead_tuples - threshold) / threshold. > > "threshold" is what gets calculated as the number of tuples over which > > a table is considered for vacuuming. This number, then, is a relative > > measure of how hard is vacuuming needed. > > The floor(log(size)) part seems like it will have rather arbitrary > behavioral shifts when a table grows just past a log boundary. Also, > I'm not exactly sure whether you're proposing smaller tables first or > bigger tables first, nor that either of those orderings is a good thing. That seems dubious to me as well. > I think sorting by just age(relfrozenxid) for for-wraparound tables, and > just the n_dead_tuples measurement for others, is probably reasonable > for now. If we find out that has bad behaviors then we can look at how > to fix them, but I don't think we have enough understanding yet of what > the bad behaviors might be. If we want another ordering criterion than that it might be worth thinking about something like n_dead_tuples/relpages to make sure that small tables with a high dead tuples ratio get vacuumed in time. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 25, 2013 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The floor(log(size)) part seems like it will have rather arbitrary > behavioral shifts when a table grows just past a log boundary. Also, > I'm not exactly sure whether you're proposing smaller tables first or > bigger tables first, nor that either of those orderings is a good thing. > > I think sorting by just age(relfrozenxid) for for-wraparound tables, and > just the n_dead_tuples measurement for others, is probably reasonable > for now. If we find out that has bad behaviors then we can look at how > to fix them, but I don't think we have enough understanding yet of what > the bad behaviors might be. Which is exactly why back-patching this is not a good idea, IMHO. We could easily run across a system where pg_class order happens to be better than anything else we come up with. Such changes are expected in new major versions, but not in maintenance releases. I think that to do this right, we need to consider not only the status quo but the trajectory. For example, suppose we have two tables to process, one of which needs a wraparound vacuum and the other one of which needs dead tuples removed. If the table needing the wraparound vacuum is small and just barely over the threshold, it isn't urgent; but if it's large and way over the threshold, it's quite urgent. Similarly, if the table which needs dead tuples removed is rarely updated, postponing vacuum is not a big deal, but if it's being updated like crazy, postponing vacuum is a big problem. Categorically putting autovacuum wraparound tables ahead of everything else seems simplistic, and thinking that more dead tuples is more urgent than fewer dead tuples seems *extremely* simplistic. I ran across a real-world case where a user had a small table that had to be vacuumed every 15 seconds to prevent bloat. If we change the algorithm in a way that gives other things priority over that table, then that user could easily get hosed when they install a maintenance release containing this change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-01-25 12:19:25 -0500, Robert Haas wrote: > On Fri, Jan 25, 2013 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The floor(log(size)) part seems like it will have rather arbitrary > > behavioral shifts when a table grows just past a log boundary. Also, > > I'm not exactly sure whether you're proposing smaller tables first or > > bigger tables first, nor that either of those orderings is a good thing. > > > > I think sorting by just age(relfrozenxid) for for-wraparound tables, and > > just the n_dead_tuples measurement for others, is probably reasonable > > for now. If we find out that has bad behaviors then we can look at how > > to fix them, but I don't think we have enough understanding yet of what > > the bad behaviors might be. > I think that to do this right, we need to consider not only the status > quo but the trajectory. For example, suppose we have two tables to > process, one of which needs a wraparound vacuum and the other one of > which needs dead tuples removed. If the table needing the wraparound > vacuum is small and just barely over the threshold, it isn't urgent; > but if it's large and way over the threshold, it's quite urgent. > Similarly, if the table which needs dead tuples removed is rarely > updated, postponing vacuum is not a big deal, but if it's being > updated like crazy, postponing vacuum is a big problem. Categorically > putting autovacuum wraparound tables ahead of everything else seems > simplistic, and thinking that more dead tuples is more urgent than > fewer dead tuples seems *extremely* simplistic. I don't think the first part is problematic. Which scenario do you have in mind where that would really cause adverse behaviour? autovacuum seldomly does full table vacuums on tables otherwise these days so tables get "old" in that sense pretty regularly and mostly uniform. I agree that the second criterion isn't worth very much and that we need something better there. > I ran across a real-world case where a user had a small table that had > to be vacuumed every 15 seconds to prevent bloat. If we change the > algorithm in a way that gives other things priority over that table, > then that user could easily get hosed when they install a maintenance > release containing this change. I think if we backpatch this we should only prefer wraparound tables and leave the rest unchanged. > Which is exactly why back-patching this is not a good idea, IMHO. We > could easily run across a system where pg_class order happens to be > better than anything else we come up with. Such changes are expected > in new major versions, but not in maintenance releases. I think a minimal version might be acceptable. Its a bug if the database regularly shuts down and you need to write manual vacuuming scripts to prevent it from happening. I don't think the argument that the pg_class order might work better than anything holds that much truth - its not like thats something really stable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I think if we backpatch this we should only prefer wraparound tables and > leave the rest unchanged. That's not a realistic option, at least not with anything that uses this approach to sorting the tables. You'd have to assume that qsort() is stable which it probably isn't. > I don't think the argument that the pg_class order might work better > than anything holds that much truth - its not like thats something > really stable. I find that less than credible as well. regards, tom lane
On Fri, Jan 25, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-01-25 11:51:33 -0500, Tom Lane wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> > 2. for other tables, consider floor(log(size)). This makes tables of >> > sizes in the same ballpark be considered together. >> >> > 3. For tables of similar size, consider >> > (n_dead_tuples - threshold) / threshold. >> > "threshold" is what gets calculated as the number of tuples over which >> > a table is considered for vacuuming. This number, then, is a relative >> > measure of how hard is vacuuming needed. >> >> The floor(log(size)) part seems like it will have rather arbitrary >> behavioral shifts when a table grows just past a log boundary. Also, >> I'm not exactly sure whether you're proposing smaller tables first or >> bigger tables first, nor that either of those orderings is a good thing. > > That seems dubious to me as well. > >> I think sorting by just age(relfrozenxid) for for-wraparound tables, and >> just the n_dead_tuples measurement for others, is probably reasonable >> for now. If we find out that has bad behaviors then we can look at how >> to fix them, but I don't think we have enough understanding yet of what >> the bad behaviors might be. > > If we want another ordering criterion than that it might be worth > thinking about something like n_dead_tuples/relpages to make sure that > small tables with a high dead tuples ratio get vacuumed in time. I'd imagine it a good idea to reserve some autovacuum connections for small tables, that is, to have a maximum relpages for some portion of the connections. That way you don't get stuck having all the connections busy working on huge tables and leaving small tables starved. That scenario seems pretty obvious. I'd be inclined to do something a bit more sophisticated than just age(relfrozenxid) for wraparound; I'd be inclined to kick off large tables' wraparound vacuums earlier than those for smaller tables. With a little bit of noodling around, here's a thought for a joint function that I *think* has reasonably common scales: f(deadtuples, relpages, age) = deadtuples/relpages + e ^ (age*ln(relpages)/2^32) When the age of the table is low, this is dominated by the deadtuple/relpages part of the equation; you vacuum tables based on what has the largest % of dead tuples. But when a table is not vacuumed for a long time, the second term will kick in, and we'll tend to:a) Vacuum the ones that are largest the earliest, but nonethelessb) Vacuum them as the ration of age/2^32gets close to 1. This function assumes relpages > 0, and there's a constant, 2^32, there which might be fiddled with. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On 2013-01-25 12:52:46 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I think if we backpatch this we should only prefer wraparound tables and > > leave the rest unchanged. > > That's not a realistic option, at least not with anything that uses this > approach to sorting the tables. You'd have to assume that qsort() is > stable which it probably isn't. Well, comparing them equally will result in an about as arbitrary order as right now, so I don't really see a problem with that. I am fine with sorting them truly randomly as well (by assining a temporary value when putting it into the list so the comparison is repeatable and conforms to the triangle inequality etc). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I think that to do this right, we need to consider not only the status >> quo but the trajectory. For example, suppose we have two tables to >> process, one of which needs a wraparound vacuum and the other one of >> which needs dead tuples removed. If the table needing the wraparound >> vacuum is small and just barely over the threshold, it isn't urgent; >> but if it's large and way over the threshold, it's quite urgent. >> Similarly, if the table which needs dead tuples removed is rarely >> updated, postponing vacuum is not a big deal, but if it's being >> updated like crazy, postponing vacuum is a big problem. Categorically >> putting autovacuum wraparound tables ahead of everything else seems >> simplistic, and thinking that more dead tuples is more urgent than >> fewer dead tuples seems *extremely* simplistic. > > I don't think the first part is problematic. Which scenario do you have > in mind where that would really cause adverse behaviour? autovacuum > seldomly does full table vacuums on tables otherwise these days so > tables get "old" in that sense pretty regularly and mostly uniform. I'm worried about the case of a very, very frequently updated table getting put ahead of a table that needs a wraparound vacuum, but only just. It doesn't sit well with me to think that the priority of that goes from 0 (we don't even try to update it) to infinity (it goes ahead of all tables needing to be vacuumed for dead tuples) the instant we hit the vacuum_freeze_table_age. One idea would be to give each table a "badness". So estimate the percentage of the tuples in each table that are dead. And then we compute the percentage by which age(relfrozenxid) exceeds the table age, and add those two percentages up to get total badness. We process tables that are otherwise-eligible for vacuuming in descending order of badness. So if autovacuum_vacuum_scale_factor = 0.2 and a table is more than than 120% of vacuum_freeze_table_age, then it's certain to be vacuumed before any table that only needs dead-tuple processing. But if it's only slightly past the cutoff, it doesn't get to stomp all over the people who need dead tuples cleaned up. The thing is, avoiding a full-cluster shutdown due to anti-wraparound vacuum is important. But, IME, that rarely happens. What is much more common is that an individual table gets bloated and CLUSTER or VACUUM FULL is required to recover, and now the system is effectively down for as long as that takes to complete. I don't want to make that case substantially more likely just to avoid a danger of full-cluster shutdown that, for most users most of the time, is really a very remote risk. There's some point at which an anti-wraparound vacuum should not only trump everything else, but probably also ignore the configured cost delay settings - but equating that point with the first point at which we consider doing it at all does not seem right to me. > I think a minimal version might be acceptable. Its a bug if the database > regularly shuts down and you need to write manual vacuuming scripts to > prevent it from happening. > > I don't think the argument that the pg_class order might work better > than anything holds that much truth - its not like thats something > really stable. I freely admit that if pg_class order happens to work better, it's just good luck. But sometimes people get lucky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I don't think the first part is problematic. Which scenario do you have >> in mind where that would really cause adverse behaviour? autovacuum >> seldomly does full table vacuums on tables otherwise these days so >> tables get "old" in that sense pretty regularly and mostly uniform. > I'm worried about the case of a very, very frequently updated table > getting put ahead of a table that needs a wraparound vacuum, but only > just. It doesn't sit well with me to think that the priority of that > goes from 0 (we don't even try to update it) to infinity (it goes > ahead of all tables needing to be vacuumed for dead tuples) the > instant we hit the vacuum_freeze_table_age. Well, really the answer to that is that we have multiple autovac workers, and even if the first one that comes along picks the wraparound job, the next one won't. Having said that, I agree that it might be better to express the sort priority as some sort of continuous function of multiple figures of merit, rather than "sort by one then the next". See Chris Browne's mail for another variant. regards, tom lane
On Fri, Jan 25, 2013 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> I don't think the first part is problematic. Which scenario do you have >>> in mind where that would really cause adverse behaviour? autovacuum >>> seldomly does full table vacuums on tables otherwise these days so >>> tables get "old" in that sense pretty regularly and mostly uniform. > >> I'm worried about the case of a very, very frequently updated table >> getting put ahead of a table that needs a wraparound vacuum, but only >> just. It doesn't sit well with me to think that the priority of that >> goes from 0 (we don't even try to update it) to infinity (it goes >> ahead of all tables needing to be vacuumed for dead tuples) the >> instant we hit the vacuum_freeze_table_age. > > Well, really the answer to that is that we have multiple autovac > workers, and even if the first one that comes along picks the wraparound > job, the next one won't. Sure, but you could easily have 10 or 20 cross the vacuum_freeze_table_age threshold simultaneously - and you'll only be able to process a few of those at a time, due to autovacuum_max_workers. Moreover, even if you don't hit the autovacuum_max_workers limit (say it's jacked up to 100 or so), you're still introducing a delay of up to N * autovacuum_naptime, where N is the number of tables that cross the threshold at the same instant, before any dead-tuple cleanup vacuums are initiated. It's not difficult to imagine that being bad. > Having said that, I agree that it might be better to express the > sort priority as some sort of continuous function of multiple figures of > merit, rather than "sort by one then the next". See Chris Browne's > mail for another variant. Ah, so. I think, though, that my variant is a whole lot simpler and accomplishes mostly the same purpose. One difference between my proposal and the others that have popped up thus far is that I am not convinced table size matters, or at least not in the way that people are proposing to make it matter. The main reason I can see why big tables matter more than small tables is that a big table takes *longer* to autovacuum than a small table. If you are 123,456 transactions from a cluster-wide shutdown, and there is one big table and one small table that need to be autovacuumed, you had better start on the big one first - because the next autovacuum worker to come along will quite possibly be able to finish the small one before doomsday, but if you don't start the big one now you won't finish in time. This remains true even if the small table has a slightly older relfrozenxid than the large one, but ceases to be true when the difference is large enough that vacuuming the small one first will advance datfrozenxid enough to extend the time until a shutdown occurs by more than the time it takes to vacuum it. For dead-tuple vacuuming, the question of whether the table is large or small does not seem to me to have a categorical right answer. You could argue that it's more important recover 2GB of space in a 20GB table than 2MB of space in a 20MB table, because more space is being wasted. On the flip side you could argue that a small table becomes bloated much more easily than a large table, because even a minute of heavy update activity can turn over the entire table contents, which is unlikely for a larger table. I am inclined to think that the percentage of dead tuples is a more important rubric - if things are going well, it shouldn't ever be much different from the threshold that triggers AV in the first place - but if somehow it is much different (e.g. because the table's been locked for a while, or is accumulating more bloat that the threshold in a single autovacuum_naptime), that seems like good justification for doing it ahead of other things that are less bloated. We do need to make sure that the formula is defined in such a way that something that is *severely* past vacuum_freeze_table_age always beats an arbitrarily-bloated table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote: > We > could easily run across a system where pg_class order happens to be > better than anything else we come up with. I think you should read that back to yourself and see if you still feel the word "easily" applies here. I agree with Tom that its hard for almost any prioritisation not to be better than we have now. But also, we should keep it fairly simple to avoid introducing new behaviour that defeats people with a highly tuned vacuum config. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 27, 2013 at 4:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote: >> We >> could easily run across a system where pg_class order happens to be >> better than anything else we come up with. > > I think you should read that back to yourself and see if you still > feel the word "easily" applies here. I absolutely do. You will not convince me that whacking around the behavior of autovacuum in a maintenance release is a remotely sane thing to do. There are plenty of things wrong with the way autovacuum works today, and I am all in favor of fixing them - but not in the back-branches. Every time we whack behavior around in the back branches, no matter how innocuous it looks, somebody's environment gets broken, and then they won't apply patch releases, and it causes all sorts of headaches. At least, that's my experience at EnterpriseDB. YMMV. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 25, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think that to do this right, we need to consider not only the status > quo but the trajectory. For example, suppose we have two tables to > process, one of which needs a wraparound vacuum and the other one of > which needs dead tuples removed. If the table needing the wraparound > vacuum is small and just barely over the threshold, it isn't urgent; But it being small, it also won't take long to vacuum. Why not just do it? > but if it's large and way over the threshold, it's quite urgent. > Similarly, if the table which needs dead tuples removed is rarely > updated, postponing vacuum is not a big deal, but if it's being > updated like crazy, postponing vacuum is a big problem. I don't see this as being the case. If it is being updated like crazy, it doesn't matter whether it meets the threshold to have tuples removed *right at the moment* or not. It will meet that threshold soon. If you can't keep up with that need with your current settings, you have a steady-state problem. Changing the order, or not changing the order, isn't going to make a whole lot of difference, you need to overcome the steady-state problem. > Categorically > putting autovacuum wraparound tables ahead of everything else seems > simplistic, and thinking that more dead tuples is more urgent than > fewer dead tuples seems *extremely* simplistic. > > I ran across a real-world case where a user had a small table that had > to be vacuumed every 15 seconds to prevent bloat. If we change the > algorithm in a way that gives other things priority over that table, Eventually an anti-wrap around is going to be done, and once it starts it does have priority, because things already underway don't get preempted. Have they ever reached that point? Did it cause problems? > then that user could easily get hosed when they install a maintenance > release containing this change. Yeah, I don't know that back-patching is a good idea, or at least not soon. Cheers, Jeff
On Fri, Jan 25, 2013 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I'm worried about the case of a very, very frequently updated table > getting put ahead of a table that needs a wraparound vacuum, but only > just. It doesn't sit well with me to think that the priority of that > goes from 0 (we don't even try to update it) to infinity (it goes > ahead of all tables needing to be vacuumed for dead tuples) the > instant we hit the vacuum_freeze_table_age. What if it were the instant we hit autovacuum_freeze_max_age, not vacuum_freeze_table_age? Or does the current behavior already do this? Which process is responsible for enforcing autovacuum_freeze_max_age? Cheers, Jeff
On Thu, Jan 24, 2013 at 1:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hi, > > I have a bug pending that autovacuum fails to give priority to > for-wraparound tables. When xid consumption rate is high and dead tuple > creation is also high, it is possible that some tables are waiting for > for-wraparound vacuums that don't complete in time because the workers > are busy processing other tables that have accumulated dead tuples; the > system is then down because it's too near the Xid wraparound horizon. > Apparently this is particularly notorious in connection with TOAST > tables, because those are always put in the tables-to-process list after > regular tables. Is something killing off your autovacuum workers routinely, such that they rarely reach the end of their to-do list? Otherwise it seems like the tables would come up for vacuuming in a cyclic fashion, staggered for each worker; and it being a circle it shouldn't systematically matter where in it they were added. What are the various settings for vacuuming? Cheers, Jeff
On 27 January 2013 17:11, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jan 27, 2013 at 4:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote: >>> We >>> could easily run across a system where pg_class order happens to be >>> better than anything else we come up with. >> >> I think you should read that back to yourself and see if you still >> feel the word "easily" applies here. > > I absolutely do. > You will not convince me that whacking around the > behavior of autovacuum in a maintenance release is a remotely sane > thing to do. This is a different argument. It would be better to say this than to come up with implausible problems as a way of rejecting something. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> So I think we need to sort by age(relfrozenxid) in tables that are over > the anti-wraparound limit. Given your code that doesn't seem to be that > hard? I might also suggest that we think about changing the defaults for wraparound vacuum behavior. Partcularly, the fact that vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is optimal for absolutely nobody, and forces re-wraparound vacuuming of wraparound tables which were just recently wraparound-vacuumed. We should lower vacuum_freeze_min_age to something sane, like 1000000. (background: http://www.databasesoup.com/2012/10/freezing-your-tuples-off-part-2.html) Also, while I don't know if Alvaro's optimization is a net gain or not (It might be), I do agree that backpatching it is not worth considering. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-01-29 00:11:12 +1100, Josh Berkus wrote: > > > So I think we need to sort by age(relfrozenxid) in tables that are over > > the anti-wraparound limit. Given your code that doesn't seem to be that > > hard? > > I might also suggest that we think about changing the defaults for > wraparound vacuum behavior. Partcularly, the fact that > vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is > optimal for absolutely nobody, and forces re-wraparound vacuuming of > wraparound tables which were just recently wraparound-vacuumed. We > should lower vacuum_freeze_min_age to something sane, like 1000000. I have to admit, I fail to see why this is a good idea. There isn't much of an efficiency bonus in freezing early (due to hint bits) and vacuums over vacuum_freeze_table_age are considerably more expensive as they have to scan the whole heap instead of using the visibilitymap. And if you don't vacuum the whole heap you can't lower relfrozenxid. So changing freeze_min_age doesn't help at all to avoid anti-wraparound vacuums. Am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-01-29 00:11:12 +1100, Josh Berkus wrote: >> >>> So I think we need to sort by age(relfrozenxid) in tables that >>> are over the anti-wraparound limit. Given your code that >>> doesn't seem to be that hard? >> >> I might also suggest that we think about changing the defaults >> for wraparound vacuum behavior. Partcularly, the fact that >> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by >> default is optimal for absolutely nobody, and forces >> re-wraparound vacuuming of wraparound tables which were just >> recently wraparound-vacuumed. We should lower >> vacuum_freeze_min_age to something sane, like 1000000. > > I have to admit, I fail to see why this is a good idea. There > isn't much of an efficiency bonus in freezing early (due to hint > bits) and vacuums over vacuum_freeze_table_age are considerably > more expensive as they have to scan the whole heap instead of > using the visibilitymap. And if you don't vacuum the whole heap > you can't lower relfrozenxid. So changing freeze_min_age doesn't > help at all to avoid anti-wraparound vacuums. > > Am I missing something? IMO, anything which changes an anti-wraparound vacuum of a bulk-loaded table from "read the entire table and rewrite nearly the complete table with WAL-logging" to rewriting a smaller portion of the table with WAL-logging is an improvement. Anyone who has run an OLTP load on a database which was loaded from pg_dump output or other bulk load processes, has probably experienced the pain related to the WAL-logged rewrite of massive quantities of data. Of course, since it triggers based on transaction count, the higher your transaction rate at any moment, the more likely it is to kick in at that moment. Whatever we can do to reduce that pain helps. -Kevin
On 2013-01-28 08:15:29 -0800, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-01-29 00:11:12 +1100, Josh Berkus wrote: > >> > >>> So I think we need to sort by age(relfrozenxid) in tables that > >>> are over the anti-wraparound limit. Given your code that > >>> doesn't seem to be that hard? > >> > >> I might also suggest that we think about changing the defaults > >> for wraparound vacuum behavior. Partcularly, the fact that > >> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by > >> default is optimal for absolutely nobody, and forces > >> re-wraparound vacuuming of wraparound tables which were just > >> recently wraparound-vacuumed. We should lower > >> vacuum_freeze_min_age to something sane, like 1000000. > > > > I have to admit, I fail to see why this is a good idea. There > > isn't much of an efficiency bonus in freezing early (due to hint > > bits) and vacuums over vacuum_freeze_table_age are considerably > > more expensive as they have to scan the whole heap instead of > > using the visibilitymap. And if you don't vacuum the whole heap > > you can't lower relfrozenxid. So changing freeze_min_age doesn't > > help at all to avoid anti-wraparound vacuums. > > > > Am I missing something? > > IMO, anything which changes an anti-wraparound vacuum of a > bulk-loaded table from "read the entire table and rewrite nearly > the complete table with WAL-logging" to rewriting a smaller portion > of the table with WAL-logging is an improvement. Yes, but the proposed changes make that *more* frequent, not less. 1mio transactions will usually be *after* the next checkpoint and thus be written twice. Imnsho the solution to actually solve that problem is to have have 'freeze map' that marks blocks (say 16MB) of tables as already frozen so it doesn't have to be reread/written every by (auto-)vacuum at all. I would like to work on that sometime in the future, but it won't be all that soon... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 27, 2013 at 2:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Jan 25, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think that to do this right, we need to consider not only the status >> quo but the trajectory. For example, suppose we have two tables to >> process, one of which needs a wraparound vacuum and the other one of >> which needs dead tuples removed. If the table needing the wraparound >> vacuum is small and just barely over the threshold, it isn't urgent; > > But it being small, it also won't take long to vacuum. Why not just do it? Because "big" and "small" are relative terms. >> but if it's large and way over the threshold, it's quite urgent. >> Similarly, if the table which needs dead tuples removed is rarely >> updated, postponing vacuum is not a big deal, but if it's being >> updated like crazy, postponing vacuum is a big problem. > > I don't see this as being the case. If it is being updated like > crazy, it doesn't matter whether it meets the threshold to have tuples > removed *right at the moment* or not. It will meet that threshold > soon. If you can't keep up with that need with your current settings, > you have a steady-state problem. Changing the order, or not changing > the order, isn't going to make a whole lot of difference, you need to > overcome the steady-state problem. Sure. There are many people for which vacuum has no trouble at all keeping up, and others for whom it isn't even close to keeping up. People in the first category aren't likely to be damaged by the proposed change and people in the second category aren't likely to be helped. The issue is around what happens for people who are close to the edge. Will things get better or worse? Alvaro (and Simon) content that there will be cases where full-cluster shutdowns that happen under today's algorithm would be avoided if we prioritize anti-wraparound vacuums over dead-tuple-cleanup vacuums. I believe that. I also believe that there will be cases where it goes the other way - where a bloat situation that remains under control with today's algorithm gets just perturbed just enough by this change to cause runaway table bloat. Or at least, I contend that we don't have nearly enough evidence that that *won't* happen to risk back-patching a change of this type. In my experience, full-cluster shutdowns caused by autovacuum failing to advance datfrozenxid are extremely rare - and if they do happen, it's usually because the vacuum cost delay is set too high, or the cost limit too low. If we want to attack the problem of making sure such shutdowns don't happen, I'd argue that the most promising way to attack that problem is to progressively ratchet the delay down and the cost limit up as age(relfrozenxid) gets larger. On the other hand, problems with runaway table bloat are relatively common. Heikki's 8.4-era changes have of course helped quite a bit, but the problem is still very, very common. All you need is a series of "long"-running transactions (like a couple of *minutes* on a busy system), or a vacuum cost delay that is just ever-so-slightly too high, and you're completely hosed. I agree with you that if you've got a database that's well-tuned, so that you aren't skating on the ragged edge of disaster, this change probably won't break anything. But I am willing to bet that there are people out there who are, completely unknowingly, skating on that ragged edge. It is not as if we provide an easy way to know whether you've got the cost delay set optimally. >> Categorically >> putting autovacuum wraparound tables ahead of everything else seems >> simplistic, and thinking that more dead tuples is more urgent than >> fewer dead tuples seems *extremely* simplistic. >> >> I ran across a real-world case where a user had a small table that had >> to be vacuumed every 15 seconds to prevent bloat. If we change the >> algorithm in a way that gives other things priority over that table, > > Eventually an anti-wrap around is going to be done, and once it starts > it does have priority, because things already underway don't get > preempted. Have they ever reached that point? Did it cause problems? In that specific case, I don't know. >> then that user could easily get hosed when they install a maintenance >> release containing this change. > > Yeah, I don't know that back-patching is a good idea, or at least not soon. That's all I'm arguing. I think it would be nice to do something for 9.3, preferably a little more sophisticated than just "put all anti-wraparound vacuums first". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 28, 2013 at 8:11 AM, Josh Berkus <josh@agliodbs.com> wrote: >> So I think we need to sort by age(relfrozenxid) in tables that are over >> the anti-wraparound limit. Given your code that doesn't seem to be that >> hard? > > I might also suggest that we think about changing the defaults for > wraparound vacuum behavior. Partcularly, the fact that > vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is > optimal for absolutely nobody, and forces re-wraparound vacuuming of > wraparound tables which were just recently wraparound-vacuumed. We > should lower vacuum_freeze_min_age to something sane, like 1000000. Currently, the defaults are 50m for vacuum_freeze_min_age, 150m for vacuum_freeze_table_age, and 200m for autovacuum_freeze_max_age, which doesn't seem like it matches what you're saying. vacuum_freeze_min_age doesn't force anti-wraparound vacuuming anyway; it just controls the tuple age at which opportunistic freezing starts.There's possibly some advantage to lowering it anyway,in the hopes of leaving less work for the eventual full-table scan to do, but dropping it to as low as 1m seems like it might be too much of a good thing - we might end up freezing a lot of tuples that would have gotten removed before they hit the 50m freeze threshold we have today. While we're talking about prioritizing vacuum, though, it would be nice to have a high-priority anti-wraparound vacuum and a low-priority anti-wraparound vacuum. When a table crosses the "low priority" threshold, we work on freezing it *if there is no other vacuuming that needs doing* - i.e. we're the only autovacuum process running, and no other table needs autovacuuming either for dead tuples or for wraparound. When it crosses the "high priority" threshold, then we behave as now, or perhaps even more aggressively (move to front of queue, raise or cost delay or ignore it outright, etc.). The "low priority" anti-wraparound vacuum would abort if some other vacuuming came to be required, if a lock conflict occurred, etc., and might also run with a higher cost delay. I believe this would tend to spread the anti-wraparound work out over a longer period of time, instead of clumping it all together as often happens today, and reduce the effect it has on foreground activities. It might not be going far enough in that direction but at least it would be a start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I have to admit, I fail to see why this is a good idea. There isn't much > of an efficiency bonus in freezing early (due to hint bits) and vacuums > over vacuum_freeze_table_age are considerably more expensive as they > have to scan the whole heap instead of using the visibilitymap. And if > you don't vacuum the whole heap you can't lower relfrozenxid. So > changing freeze_min_age doesn't help at all to avoid anti-wraparound > vacuums. > > Am I missing something? Yep. First, you're confusing vacuum_freeze_table_age and vacuum_freeze_min_age. Second, you're not doing any arithmatic. Let's do this by example. TableA is a large table which receives an almost constant stream of individual row updates, inserts, and deletes. DEFAULTS: XID 1: First rows in TableA are updated. XID 200m: Anti-wraparound autovac of TableA. All XIDs older than XID 100m set to FROZENXID. XID 300m: Anti-wraparound autovac of TableA All XIDs older than XID 200M set to FROZENXID. XID 400m: Anti-wraparound autovac of TableA All XIDs older than XID 300M set to FROZENXID. XID 500m: Anti-wraparound autovac of TableA All XIDs older than XID 400M set to FROZENXID. XID 600m: Anti-wraparound autovac of TableA All XIDs older than XID 500M set to FROZENXID. vacuum_freeze_min_age = 1m XID 1: First rows in TableA are updated. XID 200m: Anti-wraparound autovac of TableA. All XIDs older than XID 199m set to FROZENXID. XID 399m: Anti-wraparound autovac of TableA All XIDs older than XID 398M set to FROZENXID. XID 598m: Anti-wraparound autovac of TableA All XIDs older than XID 597M set to FROZENXID. vacuum_freeze_min_age = 1m, autovacuum_freeze_max_age = 500m XID 1: First rows in TableA are updated. XID 500m: Anti-wraparound autovac of TableA. All XIDs older than XID 499m set to FROZENXID. As you can see, the current default settings cause 80% more wraparound autovacs per table than vacuum_freeze_min_age of 1m would, and almost 500% more than what I consider sane settings would. Just so that we can preserve XIDs which are 90m transactions old. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-01-29 16:09:52 +1100, Josh Berkus wrote: > > > I have to admit, I fail to see why this is a good idea. There isn't much > > of an efficiency bonus in freezing early (due to hint bits) and vacuums > > over vacuum_freeze_table_age are considerably more expensive as they > > have to scan the whole heap instead of using the visibilitymap. And if > > you don't vacuum the whole heap you can't lower relfrozenxid. So > > changing freeze_min_age doesn't help at all to avoid anti-wraparound > > vacuums. > > > > Am I missing something? > > Yep. First, you're confusing vacuum_freeze_table_age and > vacuum_freeze_min_age. Don't think I did. I was talking about vacuum_freeze_table_age because that influences the amount of full-table scans in contrast to ones using the vm. Thats an independent thing from anti-wraparound vacuums which are triggered by autovacuum_freeze_max_age. The point I was trying to make is that a very big part of the load is not actually the freezing itself but the full-table vacuums which are triggered by freeze_table_age. > Second, you're not doing any arithmatic. Because its not actually as easy to calculate as you make it seem. Even in the case of a large insert-only table you have way much more complex behaviour than what you describe. The lifetime of tuples/buffers in that context is approx the following: - inserted - written by bgwriter or by checkpoint - vacuum reads the non-all-visible part of the table - vacuum sets HEAP_XMIN_COMMITTED - freeze_table_age vacuum reads the whole table - doesn't find anything because of freeze_min_age - freeze_table_age vacuum reads the whole table - freezes tuple because >= freeze_min_age - freeze_table_age vacuum reads the whole table - doesn't change anything in our page because its already frozen So the big point your computation is missing is that all those anti-wraparound vacuums a) might not even happen due to normal vacuums being over freeze_table_age which change the relfrozenxid b) don't rewrite the whole table because the tuples actually are already frozen c) will be written out a page repeatedly because of tuples that get changed again d) incur full page writes. > Let's do this by example. TableA is a large table which receives an > almost constant stream of individual row updates, inserts, and deletes. > > DEFAULTS: > > XID 1: First rows in TableA are updated. > XID 200m: Anti-wraparound autovac of TableA. > All XIDs older than XID 100m set to FROZENXID. Between those the table will have been vacuumed already and depending on the schedule the tuples will already have been vacuumed due to freeze_min_age being 50mio and freeze_table_age being 150mio. Before that all the tuples will already have been written another time for hint bit writes. > XID 300m: Anti-wraparound autovac of TableA > All XIDs older than XID 200M set to FROZENXID. Only the newer tuples are going to be rewritten, the older parts of the table will only be read. > XID 400m: Anti-wraparound autovac of TableA > All XIDs older than XID 300M set to FROZENXID. > XID 500m: Anti-wraparound autovac of TableA > All XIDs older than XID 400M set to FROZENXID. > XID 600m: Anti-wraparound autovac of TableA > All XIDs older than XID 500M set to FROZENXID. > vacuum_freeze_min_age = 1m > > XID 1: First rows in TableA are updated. > XID 200m: Anti-wraparound autovac of TableA. > All XIDs older than XID 199m set to FROZENXID. Even in an insert-only case the tuples will be written at least twice before an anti-freeze-wraparound, often thrice: - first checkpoint - hint bit sets due to a normal vacuum - frozen due to a full-table vacuum But, as you assumed the table will also get deletes and updates the low freeze age will mean that some tuples on a page will get frozen on each vacuum that reads the page which incurs a full-page-write everytime the some tuples are frozen as most of the time the last time the page was touched will be before the last checkpoint happened. As the WAL is a major bottleneck on a write-heavy server that can incur a pretty heft global slowdown. Its *good* to only freeze tuples once youre pretty damn sure it won't be touched by actual data changes again. As full-table vacuums happen more frequently than anti-freeze vacuums anyway the cost of actual anti-freeze vacuums, should they happen because of a too busy autovacuum, aren't a problem in itself. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-01-29 16:09:52 +1100, Josh Berkus wrote: >> >>> I have to admit, I fail to see why this is a good idea. There >>> isn't much of an efficiency bonus in freezing early (due to >>> hint bits) and vacuums over vacuum_freeze_table_age are >>> considerably more expensive as they have to scan the whole heap >>> instead of using the visibilitymap. And if you don't vacuum the >>> whole heap you can't lower relfrozenxid. So changing >>> freeze_min_age doesn't help at all to avoid anti-wraparound >>> vacuums. >>> >>> Am I missing something? >> >> Yep. First, you're confusing vacuum_freeze_table_age and >> vacuum_freeze_min_age. > > Don't think I did. I was talking about vacuum_freeze_table_age > because that influences the amount of full-table scans Not any more than vacuum_freeze_min_age does. http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM | a whole table sweep is forced if the table hasn't been fully | scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age | transactions. So reducing vacuum_freeze_min_age not only helps minimize the writes that are needed when autovacuum needs to scan the entire heap, but also decreases the frequency of those full-table scans. -Kevin
On 2013-01-30 05:39:29 -0800, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-01-29 16:09:52 +1100, Josh Berkus wrote: > >> > >>> I have to admit, I fail to see why this is a good idea. There > >>> isn't much of an efficiency bonus in freezing early (due to > >>> hint bits) and vacuums over vacuum_freeze_table_age are > >>> considerably more expensive as they have to scan the whole heap > >>> instead of using the visibilitymap. And if you don't vacuum the > >>> whole heap you can't lower relfrozenxid. So changing > >>> freeze_min_age doesn't help at all to avoid anti-wraparound > >>> vacuums. > >>> > >>> Am I missing something? > >> > >> Yep. First, you're confusing vacuum_freeze_table_age and > >> vacuum_freeze_min_age. > > > > Don't think I did. I was talking about vacuum_freeze_table_age > > because that influences the amount of full-table scans > > Not any more than vacuum_freeze_min_age does. Well, vacuum_freeze_min_age is 50m while vacuum_freeze_table_age is 150m. > http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM > > | a whole table sweep is forced if the table hasn't been fully > | scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age > | transactions. > > So reducing vacuum_freeze_min_age not only helps minimize the > writes that are needed when autovacuum needs to scan the entire > heap, but also decreases the frequency of those full-table scans. But it increases the amount of pages that are written out multiple times because they contain tuples of different ages, in contrast to increasing vacuum_freeze_table_age which doesn't have that problem. In combination with full_page_writes that makes a noticeable different in total write volume. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-30 14:58:24 +0100, Andres Freund wrote: > > So reducing vacuum_freeze_min_age not only helps minimize the > > writes that are needed when autovacuum needs to scan the entire > > heap, but also decreases the frequency of those full-table scans. > > But it increases the amount of pages that are written out multiple times > because they contain tuples of different ages, in contrast to increasing > vacuum_freeze_table_age which doesn't have that problem. In combination > with full_page_writes that makes a noticeable different in total write > volume. Btw, as far as I read the code that behaviour only exists insofar that the last time vacuum runs it freezes all tuples below freeze_min_age but not newer ones, so relfrozenxid will only be set to current_xmin - freeze_min_age. But if you manually freeze or no such old tuples exist its solely influenced by freeze_table_age. The relevant parts of the code are: c.f. vacuum_set_xid_limits: /* * Determine the table freeze age to use: as specified by the caller, * or vacuum_freeze_table_age, but in any case not more than * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples * before anti-wraparound autovacuum is launched. */ freezetable = freeze_min_age; if (freezetable < 0) freezetable = vacuum_freeze_table_age; freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95); Assert(freezetable >= 0); /* * Compute the cutoff XID, being careful not to generate a "permanent" * XID. */ limit = ReadNewTransactionId() - freezetable; if (!TransactionIdIsNormal(limit)) limit = FirstNormalTransactionId; *freezeTableLimit = limit; lazy_vacuum_rel: scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid, freezeTableLimit); If youre careful you can also notice that there is an interesting typo in the freeze table computation. Namely it uses freeze_min_age instead of freeze_table_age. Which probably explains why I had so bad performance results with lowering vacuum_freeze_min_age, it basically radically increases the amount of full-table-scans, far more than it should. I can't imagine that anybody with a large database ran pg successfully with a small freeze_min_age due to this. It seems to be broken since the initial introduction of freeze_table_age in 6587818542e79012276dcfedb2f97e3522ee5e9b. I guess it wasn't noticed because the behaviour is only visible via autovacuum because a user-issued VACUUM passes -1 as freeze_min_age. Trivial patch attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Mon, Jan 28, 2013 at 9:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > > Let's do this by example. TableA is a large table which receives an > almost constant stream of individual row updates, inserts, and deletes. > > DEFAULTS: > > XID 1: First rows in TableA are updated. > XID 200m: Anti-wraparound autovac of TableA. > All XIDs older than XID 100m set to FROZENXID. > XID 300m: Anti-wraparound autovac of TableA > All XIDs older than XID 200M set to FROZENXID. > XID 400m: Anti-wraparound autovac of TableA > All XIDs older than XID 300M set to FROZENXID. > XID 500m: Anti-wraparound autovac of TableA > All XIDs older than XID 400M set to FROZENXID. > XID 600m: Anti-wraparound autovac of TableA > All XIDs older than XID 500M set to FROZENXID. You seem to be assuming the only reason for vacuums to occur on this table is autovacuum_freeze_max_age, which doesn't seem likely to be the case for a busy table. > vacuum_freeze_min_age = 1m > > XID 1: First rows in TableA are updated. > XID 200m: Anti-wraparound autovac of TableA. > All XIDs older than XID 199m set to FROZENXID. > XID 399m: Anti-wraparound autovac of TableA > All XIDs older than XID 398M set to FROZENXID. > XID 598m: Anti-wraparound autovac of TableA > All XIDs older than XID 597M set to FROZENXID. > > > vacuum_freeze_min_age = 1m, autovacuum_freeze_max_age = 500m > > XID 1: First rows in TableA are updated. > XID 500m: Anti-wraparound autovac of TableA. > All XIDs older than XID 499m set to FROZENXID. > > As you can see, the current default settings cause 80% more wraparound > autovacs per table than vacuum_freeze_min_age of 1m would, and almost > 500% more than what I consider sane settings would. You can get nearly all the benefits of your "sane" settings just by increasing autovacuum_freeze_max_age and leaving vacuum_freeze_min_age alone. (Assuming the table doesn't get vacuumed for other reasons) Cheers, Jeff
On Wed, Jan 30, 2013 at 5:39 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Andres Freund <andres@2ndquadrant.com> wrote: >> >> Don't think I did. I was talking about vacuum_freeze_table_age >> because that influences the amount of full-table scans > > Not any more than vacuum_freeze_min_age does. There is a lot more room for vacuum_freeze_table_age to be increased, then there is for vacuum_freeze_min_age to be decreased. > http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM > > | a whole table sweep is forced if the table hasn't been fully > | scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age > | transactions. Why is vacuuming described as such a violent process? It doesn't "force" a table sweep, it just goes ahead and performs one. In general, it seems hard to tell from the docs that this only promotes a vacuum which is going to happen anyway from a vm one to a full scan one. The forcefulness makes it sound more like it is doing vacuums that wouldn't happen otherwise (like autovacuum_freeze_max_age does, which actually could be described as "force" since it turns on the autovac launcher even if it is configured to be off) > So reducing vacuum_freeze_min_age not only helps minimize the > writes that are needed when autovacuum needs to scan the entire > heap, How does it do that? If the tuple doesn't need to frozen now because it was already frozen, that just means the write happened at a different time but it still happened. Cheers, Jeff
Andres Freund <andres@2ndquadrant.com> wrote: > I can't imagine that anybody with a large database ran pg > successfully with a small freeze_min_age due to this. I can't speak to this from personal experience, because at Wisconsin Courts we found ourselves best served by running a database VACUUM FREEZE ANALYZE each night during off-peak hours. > It seems to be broken since the initial introduction of > freeze_table_age in 6587818542e79012276dcfedb2f97e3522ee5e9b. > Trivial patch attached. I didn't see a patch attached. -Kevin
On 2013-01-30 10:21:07 -0800, Kevin Grittner wrote: > > It seems to be broken since the initial introduction of > > freeze_table_age in 6587818542e79012276dcfedb2f97e3522ee5e9b. > > > Trivial patch attached. > > I didn't see a patch attached. The archive has it, so I for once haven't forgotten sending it: http://www.postgresql.org/message-id/20130130145521.GB3355@awork2.anarazel.de While its a clear bug and fixing it in HEAD is a sure thing an argument could be made that its a clear behavioural change in the back branches. I don't think it holds too much water, but wanted to mention it. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Kevin Grittner <kgrittn@ymail.com> wrote: > To: Andres Freund <andres@2ndquadrant.com> >> Trivial patch attached. > > I didn't see a patch attached. Never mind; I was looking in the wrong spot. (I just switched email providers again because the last one couldn't seem to get the email headers right for threading purposes.) -Kevin
Jeff Janes <jeff.janes@gmail.com> wrote: > Kevin Grittner <kgrittn@ymail.com> wrote: >> So reducing vacuum_freeze_min_age not only helps minimize the >> writes that are needed when autovacuum needs to scan the entire >> heap, > > How does it do that? If the tuple doesn't need to frozen now > because it was already frozen, that just means the write happened > at a different time but it still happened. There's a lot to be said for nibbling away at it during VM autovacuum runs versus doing it in big chunks in heap-scan runs, particularly if your BBU cache can absorb up to a certain amout with no real pain, but causes major pain if the write cache fills. -Kevin
On 1/25/13 11:56 AM, Christopher Browne wrote: > With a little bit of noodling around, here's a thought for a joint function > that I*think* has reasonably common scales: > > f(deadtuples, relpages, age) = > deadtuples/relpages + e ^ (age*ln(relpages)/2^32) Be careful with dead/relpages, because dead tuples increase relpages as well. The effect is extremely noticeable on frequently hit tables that need to be kept small. If you want to have a deadtuples/size metric, I think it would be far better to do deadtuples/non_bloated_table_size. Someone else in the thread mentioned that what we really need to be watching aren't raw values, but trends. Or you can think of it as watching first (or even second) derivatives if you like. I couldn't agree more. I believe there are several parts of Postgres that end up with a bunch of hard to tune GUCs specifically because we're measuring the wrong things. Take freezing for example. Since the only reason to freeze is XID wrap then the *ideal* time to start a freeze vacuum on a table is so that the vacuum would end *exactly* as we were about to hit XID wrap. Obviously that's a completely impractical goal to hit, but notice the simplicity of the goal: we only care about the vacuum ending right before we'd hit XID wrap. The only way to do that is to monitor how fast vacuums are running, how fast XIDs are being consumed, and how quickly the oldest XID in each table is advancing. Notice that all of those measurements are time derivatives. From a more practical standpoint, I think it would be extremely useful to have a metric that showed how quickly a table churned. Something like dead tuples per time period. Comparing that to the non-bloated table size should give a very strong indication of how critical frequent vacuums on that table are. I don't have a good metric in mind for freeze right now, but I do want to mention a use case that I don't think has come up before. When building a londiste slave (and presumably all the other trigger replication systems suffer from this), each table is copied over in a single transaction, and then updates start flowing in for that table. That can easily result in a scenario where you have an enormous volume of tuples that will all need freezing at almost exactly the same time. It would be nice if we could detect such a condition and freeze those tuples over time, instead of trying to freeze all of them in one shot.
Jim Nasby <jim@nasby.net> wrote: > the only reason to freeze is XID wrap The conclusions you draw seem to be based on a slightly different premise than stated here; the conclusions are only supported by the assumption that "the only reason to freeze at any particular moment is to assure that all tuples are frozen in time to prevent transaction wraparound." In a mostly write-only database, that has to happen for nearly every tuple at some point -- but *which* point is best is not necessarily "as late as possible". > then the *ideal* time to start a freeze vacuum on a table is so > that the vacuum would end *exactly* as we were about to hit XID > wrap. For a tuple which you know is going to survive long enough to be frozen, the *ideal* time to freeze a tuple, from a performance standpoint, is when the tuple is written, to eliminate a separate WAL-log operation. The *next best* time to freeze is when the hint bits are set, to avoid a separate page write. If you are doing differential backups, the *third best* time to freeze is before the first differential backup of the tuple, to avoid a separate backup after the freeze. And so on. -Kevin
On Wed, Jan 30, 2013 at 4:05 PM, Jim Nasby <jim@nasby.net> wrote: > On 1/25/13 11:56 AM, Christopher Browne wrote: >> >> With a little bit of noodling around, here's a thought for a joint >> function >> that I*think* has reasonably common scales: >> >> >> f(deadtuples, relpages, age) = >> deadtuples/relpages + e ^ (age*ln(relpages)/2^32) > > > Be careful with dead/relpages, because dead tuples increase relpages as > well. The effect is extremely noticeable on frequently hit tables that need > to be kept small. If you want to have a deadtuples/size metric, I think it > would be far better to do deadtuples/non_bloated_table_size. You may well be right about that; I didn't try to get to second order effects, just to come up with something that seemed reasonably readily computable based on existing data that seemed relevant. One approach to things is to ask if we can "tweak" that model to be materially better without requiring extra data collection. It seems plausible to do so; we might get something better by deducting dead tuples. I'll not try to present this yet; I think it's a better idea to come up with a broadly fuller model. > Someone else in the thread mentioned that what we really need to be watching > aren't raw values, but trends. Or you can think of it as watching first (or > even second) derivatives if you like. I couldn't agree more. I believe there > are several parts of Postgres that end up with a bunch of hard to tune GUCs > specifically because we're measuring the wrong things. My "first order" reaction is that this sounds great. And I reckon it would be a great idea to come up with a reasonably sophisticated model that covers all of the behaviours that we see emerging. Mind you, I'm not sure that we can actually *use* that model; it may require collecting enough more data that it becomes burdensome to collect the additional data. > Take freezing for example. Since the only reason to freeze is XID wrap then > the *ideal* time to start a freeze vacuum on a table is so that the vacuum > would end *exactly* as we were about to hit XID wrap. > > Obviously that's a completely impractical goal to hit, but notice the > simplicity of the goal: we only care about the vacuum ending right before > we'd hit XID wrap. The only way to do that is to monitor how fast vacuums > are running, how fast XIDs are being consumed, and how quickly the oldest > XID in each table is advancing. Notice that all of those measurements are > time derivatives. Yep, and I think some second derivatives will prove useful too. The function that I presented took a slightly different tack to things; the exponential term would try to make sure that tables get frozen WELL in advance of necessity. That strikes me as being safer in cases where a database is blowing through XIDs mighty quickly. > From a more practical standpoint, I think it would be extremely useful to > have a metric that showed how quickly a table churned. Something like dead > tuples per time period. Comparing that to the non-bloated table size should > give a very strong indication of how critical frequent vacuums on that table > are. That's a nice idea, though I'm pretty sure that this requires capturing and storing a lot more statistical data, which presumably worsens things (presuming the data gets stored in tables), making the database bigger, adding an extra set of data that needs to get collected, queried, and trimmed, and chewing through still more XIDs. If you could demonstrate pretty forcibly improved behaviours falling out of that, that may make it salable. There's quite a bit of work there, and we would need to accept a burden of still more stats collection. > I don't have a good metric in mind for freeze right now, but I do want to > mention a use case that I don't think has come up before. When building a > londiste slave (and presumably all the other trigger replication systems > suffer from this), each table is copied over in a single transaction, and > then updates start flowing in for that table. That can easily result in a > scenario where you have an enormous volume of tuples that will all need > freezing at almost exactly the same time. It would be nice if we could > detect such a condition and freeze those tuples over time, instead of trying > to freeze all of them in one shot. I wonder if these are perhaps good candidates for being frozen immediately; COPY FREEZE was added in not so long ago; it doesn't perfectly cover this, but if I squint at it a bit... http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 I don't see a lot of value to having the data COPYed over to a subscriber NOT getting frozen immediately. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On 1/30/13 3:28 PM, Kevin Grittner wrote: > Jim Nasby <jim@nasby.net> wrote: >> then the *ideal* time to start a freeze vacuum on a table is so >> that the vacuum would end *exactly* as we were about to hit XID >> wrap. > > For a tuple which you know is going to survive long enough to be > frozen, the *ideal* time to freeze a tuple, from a performance > standpoint, is when the tuple is written, to eliminate a separate > WAL-log operation. The *next best* time to freeze is when the hint > bits are set, to avoid a separate page write. If you are doing > differential backups, the *third best* time to freeze is before the > first differential backup of the tuple, to avoid a separate backup > after the freeze. And so on. Yeah, I hadn't considered that aspect, but I think that's a separate discussion. I agree that there are many cases where we would benefit from freezing early, and those should all help reduce the impact of XID-wrap induced freezing. Even if we do introduce "up-front" freezing, we still need to deal with all the other cases though, which in the worst case means avoiding the XID wrap. I believe that particular case (as well as the whole autovac priority question) would be best served by looking at 1st and 2nd order derivatives.
On 1/30/13 4:37 PM, Christopher Browne wrote: >> From a more practical standpoint, I think it would be extremely useful to >> >have a metric that showed how quickly a table churned. Something like dead >> >tuples per time period. Comparing that to the non-bloated table size should >> >give a very strong indication of how critical frequent vacuums on that table >> >are. > That's a nice idea, though I'm pretty sure that this requires capturing and > storing a lot more statistical data, which presumably worsens things > (presuming the data gets stored in tables), making the database bigger, > adding an extra set of data that needs to get collected, queried, and > trimmed, and chewing through still more XIDs. We don't necessarily need added storage (or at least nothing serious). It could potentially be as simple as remembering when we last took a measurement and using that to also store number of blahs per second. That's only 2 added data points (time and the rate). Of course, there are pitfalls to something that simple, though some of those pitfals have simple solutions (such as using estimated weighted means). In this case I think we would be better served by trying to define a good metric first, and then worrying about how to actually capture and store it. >> >I don't have a good metric in mind for freeze right now, but I do want to >> >mention a use case that I don't think has come up before. When building a >> >londiste slave (and presumably all the other trigger replication systems >> >suffer from this), each table is copied over in a single transaction, and >> >then updates start flowing in for that table. That can easily result in a >> >scenario where you have an enormous volume of tuples that will all need >> >freezing at almost exactly the same time. It would be nice if we could >> >detect such a condition and freeze those tuples over time, instead of trying >> >to freeze all of them in one shot. > I wonder if these are perhaps good candidates for being frozen immediately; > COPY FREEZE was added in not so long ago; it doesn't perfectly cover > this, but if I squint at it a bit... > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 > > I don't see a lot of value to having the data COPYed over to a subscriber > NOT getting frozen immediately. Yeah, COPY FREEZE would probably be highly valuable in this case.
> You can get nearly all the benefits of your "sane" settings just by > increasing autovacuum_freeze_max_age and leaving vacuum_freeze_min_age > alone. (Assuming the table doesn't get vacuumed for other reasons) Correct, it's the ratio that matters. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Christopher Browne escribió: > On Fri, Jan 25, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I'd be inclined to do something a bit more sophisticated than just > age(relfrozenxid) for wraparound; I'd be inclined to kick off large tables' > wraparound vacuums earlier than those for smaller tables. > > With a little bit of noodling around, here's a thought for a joint function > that I *think* has reasonably common scales: > > f(deadtuples, relpages, age) = > deadtuples/relpages + e ^ (age*ln(relpages)/2^32) Okay, here's a patch along these lines. I haven't considered Jim's suggestion downthread about discounting dead tuples from relpages; maybe we can do that by subtracting the pages attributed to dead ones, estimating via tuple density (reltuples/relpages). But that's no my main concern here. Instead, what I propose (and is not really in the patch), as a backpatchable item, is an approach in which the functions to compute each rel's Browne strength and sort are hooks. Normal behavior is not to sort at all, as currently, and sites that have a problem with the current random order can install a custom module that provide hooks to change ordering as they see fit. So behavior won't change for people who have no problem today. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera escribió: > Okay, here's a patch along these lines. I haven't considered Jim's > suggestion downthread about discounting dead tuples from relpages; maybe > we can do that by subtracting the pages attributed to dead ones, > estimating via tuple density (reltuples/relpages). Patch attached. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera escribió: > >> Okay, here's a patch along these lines. I haven't considered Jim's >> suggestion downthread about discounting dead tuples from relpages; maybe >> we can do that by subtracting the pages attributed to dead ones, >> estimating via tuple density (reltuples/relpages). > > Patch attached. This strikes me as too clever by half. You've introduced the concept of a "Browne strength" (apparently named for Christopher Browne) and yet you haven't even bothered to add a comment explaining the meaning of the term, let along justifying the choice of that formula rather than any other. I don't want to dog this proposal to death, because surely we can do better than the status quo here, but adopting the first formula someone proposed without any analysis of whether it does the right thing cannot possibly be the correct decision process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 31, 2013 at 2:36 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Instead, what I propose (and is not really in the patch), as a > backpatchable item, is an approach in which the functions to compute > each rel's Browne strength and sort are hooks. Normal behavior is not > to sort at all, as currently, and sites that have a problem with the > current random order can install a custom module that provide hooks to > change ordering as they see fit. So behavior won't change for people > who have no problem today. Can you think of any examples of cases where we have back-patched a new hook? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Instead, what I propose (and is not really in the patch), as a > backpatchable item, is an approach in which the functions to compute > each rel's Browne strength and sort are hooks. Normal behavior is not > to sort at all, as currently, and sites that have a problem with the > current random order can install a custom module that provide hooks to > change ordering as they see fit. So behavior won't change for people > who have no problem today. Meh. I'm not really thrilled with adding hooks (that presumably we'd have to preserve forever) to solve a short-term problem. Nor does this sound hugely convenient for users with the problem, anyway. Do we even know for sure that anyone would create such modules? I think we should just fix it as best we can in HEAD, and then anyone who thinks the risk/reward ratio is favorable can back-patch that fix into a private build. regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Instead, what I propose (and is not really in the patch), as a > > backpatchable item, is an approach in which the functions to compute > > each rel's Browne strength and sort are hooks. Normal behavior is not > > to sort at all, as currently, and sites that have a problem with the > > current random order can install a custom module that provide hooks to > > change ordering as they see fit. So behavior won't change for people > > who have no problem today. > > Meh. I'm not really thrilled with adding hooks (that presumably we'd > have to preserve forever) to solve a short-term problem. Nor does this > sound hugely convenient for users with the problem, anyway. Do we even > know for sure that anyone would create such modules? Well, I would. Providing a custom module is many times more convenient than providing a patched binary. But since there seems to be considerable resistance to the idea I will drop it, unless others vote in favour. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas escribió: > On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Alvaro Herrera escribió: > > > >> Okay, here's a patch along these lines. I haven't considered Jim's > >> suggestion downthread about discounting dead tuples from relpages; maybe > >> we can do that by subtracting the pages attributed to dead ones, > >> estimating via tuple density (reltuples/relpages). > > > > Patch attached. > > This strikes me as too clever by half. You've introduced the concept > of a "Browne strength" (apparently named for Christopher Browne) and > yet you haven't even bothered to add a comment explaining the meaning > of the term, let along justifying the choice of that formula rather > than any other. I don't want to dog this proposal to death, because > surely we can do better than the status quo here, but adopting the > first formula someone proposed without any analysis of whether it does > the right thing cannot possibly be the correct decision process. My intention was to apply a Nasby correction to Browne Strength and call the resulting function Browne' (Browne prime). Does that sound better? Now seriously, I did experiment a bit with this and it seems to behave reasonably. Of course, there might be problems with it, and I don't oppose to changing the name. "Vacuum strength" didn't sound so great, so I picked the first term that came to mind. It's not like picking people's last names to name stuff is a completely new idea; that said, it was sort of a joke. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas escribió: >> On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >> > Alvaro Herrera escribió: >> > >> >> Okay, here's a patch along these lines. I haven't considered Jim's >> >> suggestion downthread about discounting dead tuples from relpages; maybe >> >> we can do that by subtracting the pages attributed to dead ones, >> >> estimating via tuple density (reltuples/relpages). >> > >> > Patch attached. >> >> This strikes me as too clever by half. You've introduced the concept >> of a "Browne strength" (apparently named for Christopher Browne) and >> yet you haven't even bothered to add a comment explaining the meaning >> of the term, let along justifying the choice of that formula rather >> than any other. I don't want to dog this proposal to death, because >> surely we can do better than the status quo here, but adopting the >> first formula someone proposed without any analysis of whether it does >> the right thing cannot possibly be the correct decision process. > > My intention was to apply a Nasby correction to Browne Strength and call > the resulting function Browne' (Browne prime). Does that sound better? > > Now seriously, I did experiment a bit with this and it seems to behave > reasonably. Of course, there might be problems with it, and I don't > oppose to changing the name. "Vacuum strength" didn't sound so great, > so I picked the first term that came to mind. It's not like picking > people's last names to name stuff is a completely new idea; that said, > it was sort of a joke. Color me amused :-). And, when thinking about how strong these things are, just remember, "smell isn't everything". I spent 20 minutes at a whiteboard arriving at the "Browne strength", and I think it's not unreasonable as a usage of the data already immediately at hand. But it is absolutely just intended as a strawman proposal, and I'd be pleased to see it get prodded into something more "prime." -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Andres Freund wrote: > If youre careful you can also notice that there is an interesting typo > in the freeze table computation. Namely it uses freeze_min_age instead > of freeze_table_age. Which probably explains why I had so bad > performance results with lowering vacuum_freeze_min_age, it basically > radically increases the amount of full-table-scans, far more than it > should. > > I can't imagine that anybody with a large database ran pg successfully > with a small freeze_min_age due to this. > > It seems to be broken since the initial introduction of freeze_table_age > in 6587818542e79012276dcfedb2f97e3522ee5e9b. I guess it wasn't noticed > because the behaviour is only visible via autovacuum because a > user-issued VACUUM passes -1 as freeze_min_age. Backpatched all the way back to 8.4 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > My intention was to apply a Nasby correction to Browne Strength and call > the resulting function Browne' (Browne prime). Does that sound better? /me rests head in hands. I'm not halfway clever enough to hang with this crowd; I'm not even going to touch the puns in Chris' reply. > Now seriously, I did experiment a bit with this and it seems to behave > reasonably. Of course, there might be problems with it, and I don't > oppose to changing the name. "Vacuum strength" didn't sound so great, > so I picked the first term that came to mind. It's not like picking > people's last names to name stuff is a completely new idea; that said, > it was sort of a joke. I don't think I really understand the origin of the formula, so perhaps if someone would try to characterize why it seems to behave reasonably that would be helpful (at least to me). > f(deadtuples, relpages, age) = > deadtuples/relpages + e ^ (age*ln(relpages)/2^32) To maybe make that discussion go more quickly let me kvetch about a few things to kick things off: - Using deadtuples/relpages as part of the formula means that tables with smaller tuples (thus more tuples per page) will tend to get vacuumed before tables with larger tuples (thus less tuples per page).I can't immediately see why that's a good thing. - It's probably important to have a formula where we can be sure that the wrap-around term will eventually dominate the dead-tuple term, with enough time to spare to make sure nothing really bad happens; on the other hand, it's also desirable to avoid the case where a table that has just crossed the threshold for wraparound vacuuming doesn't immediately shoot to the top of the list even if it isn't truly urgent. It's unclear to me just from looking at this formula how well the second term meets those goals. - More generally, it seems to me that we ought to be trying to think about the units in which these various quantities are measured. Each term ought to be unit-less. So perhaps the first term ought to divide dead tuples by total tuples, which has the nice property that the result is a dimensionless quantity that never exceeds 1.0. Then the second term can be scaled somehow based on that value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 30, 2013 at 6:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > c.f. > vacuum_set_xid_limits: > /* > * Determine the table freeze age to use: as specified by the caller, > * or vacuum_freeze_table_age, but in any case not more than > * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly > * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples > * before anti-wraparound autovacuum is launched. > */ > freezetable = freeze_min_age; > if (freezetable < 0) > freezetable = vacuum_freeze_table_age; > freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95); > Assert(freezetable >= 0); > > /* > * Compute the cutoff XID, being careful not to generate a "permanent" > * XID. > */ > limit = ReadNewTransactionId() - freezetable; > if (!TransactionIdIsNormal(limit)) > limit = FirstNormalTransactionId; > > *freezeTableLimit = limit; > > lazy_vacuum_rel: > scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid, > freezeTableLimit); > > If youre careful you can also notice that there is an interesting typo > in the freeze table computation. Namely it uses freeze_min_age instead > of freeze_table_age. Which probably explains why I had so bad > performance results with lowering vacuum_freeze_min_age, it basically > radically increases the amount of full-table-scans, far more than it > should. > > I can't imagine that anybody with a large database ran pg successfully > with a small freeze_min_age due to this. As far as I can tell this bug kicks in when your cluster gets to be older than freeze_min_age, and then lasts forever after. After that point pretty much every auto-vacuum inspired by update/deletion activity will get promoted to a full table scan. (Which makes me wonder how much field-testing the vm-only vacuum has received, if it was rarely happening in practice due to this bug.) Lowering the setting of freeze_min_age does not make the bug worse, it only makes it manifest earlier in the lifetime of the database. Cheers, Jeff
On 2013-02-01 14:05:46 -0800, Jeff Janes wrote: > On Wed, Jan 30, 2013 at 6:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > > c.f. > > vacuum_set_xid_limits: > > /* > > * Determine the table freeze age to use: as specified by the caller, > > * or vacuum_freeze_table_age, but in any case not more than > > * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly > > * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples > > * before anti-wraparound autovacuum is launched. > > */ > > freezetable = freeze_min_age; > > if (freezetable < 0) > > freezetable = vacuum_freeze_table_age; > > freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95); > > Assert(freezetable >= 0); > > > > /* > > * Compute the cutoff XID, being careful not to generate a "permanent" > > * XID. > > */ > > limit = ReadNewTransactionId() - freezetable; > > if (!TransactionIdIsNormal(limit)) > > limit = FirstNormalTransactionId; > > > > *freezeTableLimit = limit; > > > > lazy_vacuum_rel: > > scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid, > > freezeTableLimit); > > > > If youre careful you can also notice that there is an interesting typo > > in the freeze table computation. Namely it uses freeze_min_age instead > > of freeze_table_age. Which probably explains why I had so bad > > performance results with lowering vacuum_freeze_min_age, it basically > > radically increases the amount of full-table-scans, far more than it > > should. > > > > I can't imagine that anybody with a large database ran pg successfully > > with a small freeze_min_age due to this. > > As far as I can tell this bug kicks in when your cluster gets to be > older than freeze_min_age, and then lasts forever after. After that > point pretty much every auto-vacuum inspired by update/deletion > activity will get promoted to a full table scan. (Which makes me > wonder how much field-testing the vm-only vacuum has received, if it > was rarely happening in practice due to this bug.) I think you're misreading the code. freezeTableLimit is calculated by > > limit = ReadNewTransactionId() - freezetable; which is always relative to the current xid. The bug was that freezetable had the wrong value in autovac due to freeze_min_age being used instead of freeze_table_age. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 1, 2013 at 4:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> My intention was to apply a Nasby correction to Browne Strength and call >> the resulting function Browne' (Browne prime). Does that sound better? > > /me rests head in hands. I'm not halfway clever enough to hang with > this crowd; I'm not even going to touch the puns in Chris' reply. It's Friday... Fun needs to be had :-). >> Now seriously, I did experiment a bit with this and it seems to behave >> reasonably. Of course, there might be problems with it, and I don't >> oppose to changing the name. "Vacuum strength" didn't sound so great, >> so I picked the first term that came to mind. It's not like picking >> people's last names to name stuff is a completely new idea; that said, >> it was sort of a joke. > > I don't think I really understand the origin of the formula, so > perhaps if someone would try to characterize why it seems to behave > reasonably that would be helpful (at least to me). > >> f(deadtuples, relpages, age) = >> deadtuples/relpages + e ^ (age*ln(relpages)/2^32) > > To maybe make that discussion go more quickly let me kvetch about a > few things to kick things off: > > - Using deadtuples/relpages as part of the formula means that tables > with smaller tuples (thus more tuples per page) will tend to get > vacuumed before tables with larger tuples (thus less tuples per page). > I can't immediately see why that's a good thing. That wasn't intentional, and may be somewhat unfortunate. I picked values that I knew could be easily grabbed, and we don't have an immediate tuples-per-page estimate on pg_class. An estimate should be available in pg_statistic; I'm not sure that the bias from this hurts things badly. > - It's probably important to have a formula where we can be sure that > the wrap-around term will eventually dominate the dead-tuple term, > with enough time to spare to make sure nothing really bad happens; on > the other hand, it's also desirable to avoid the case where a table > that has just crossed the threshold for wraparound vacuuming doesn't > immediately shoot to the top of the list even if it isn't truly > urgent. It's unclear to me just from looking at this formula how well > the second term meets those goals. I think the second term *does* provide a way for wraparound to dominate; splitting it apart a bit... Consider... age * ln(relpages) e^ ---------------------------------- 2^32 The wraparound portion of this involves age/2^32... In the beginning, the numerator will be near zero, and denominator near 2 billion, so is roughly 1. As age trends towards 2^32, the fraction (ignoring ln(relpages)) trends towards 1, so that the longer we go without vacuuming, the more certain that the fraction indicates a value near 1. That *tends* to give you something looking like e^1, or 2.71828+, ignoring the relpages part. I threw in multiplying by ln(relpages) as a way to step Well Back from rollover; that means that this term will start growing considerably before rollover, and the larger the table, the sooner that growth takes place. There is a problem with the ln(relpages) term; if the table has just 1 page, the ln(relpages) = 0 so the value of the exponential term is *always* 1. Probably should have ln(relpages+CONSTANT) so that we guarantee the numerator is never 0. I'm a bit worried that the exponential term might dominate *too* quickly. For a table I have handy with 163K tuples, spread across 3357 pages, ln(relpage) = 8.1188, and the range of the "exponential bit" travels like follows: dotpro0620@localhost-> select generate_series(1,20)*100/20 as percent_wraparound, power(2.71828, (65536.0*32768.0*generate_series(1,20)/20.0 * ln(3357))/(65536.0*32768)) as wraparound_term;percent_wraparound | wraparound_term --------------------+------------------ 5 | 1.50071232210687 10 | 2.2521374737234 15 | 3.37981045789535 20 | 5.07212320054923 25 | 7.61179778630838 30| 11.4231187312988 35 | 17.1428150369499 40 | 25.7264337615498 45 | 38.607976149824 50 | 57.9394655396491 55 | 86.950469871638 60 | 130.48764154935 65 | 195.824411555774 70 | 293.876107391077 75 | 441.023495534592 80 | 661.849394087408 85 | 993.24554108594 90 | 1490.57582238538 95 | 2236.92550368832 100 | 3356.98166702019 (20 rows) At the beginning, the "wraparound" portion is just 1.5, so easily dominated by a table with a lot of dead tuples. As the time to wraparound declines, the term becomes steadily more urgent. There may be constants factors to fiddle with at the edges, but this term definitely heads towards dominance. That's definitely doing what I intended, and after constructing that table, I think I'm *more* confident. Consider that if there are two tables of different sizes, both head towards "maxing out" at a "wraparound_term" value that is directly correlated with the size of each table, which seems mighty right. A bigger table needs to get drawn into play (e.g. - needs to get vacuumed) earlier than a smaller one. > - More generally, it seems to me that we ought to be trying to think > about the units in which these various quantities are measured. Each > term ought to be unit-less. So perhaps the first term ought to divide > dead tuples by total tuples, which has the nice property that the > result is a dimensionless quantity that never exceeds 1.0. Then the > second term can be scaled somehow based on that value. Absolutely a good idea. I'm not sure I agree it ought to wind up unitless; I'd instead expect a common unit, perhaps number of pages, indicating a surrogate for quantity of I/O. That we're both thinking about "what's the unit?" means we're on a compatible trail. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Christopher Browne <cbbrowne@gmail.com> writes: > I picked values that I knew could be easily grabbed, and we don't > have an immediate tuples-per-page estimate on pg_class. Er, what? reltuples/relpages is exactly that estimate --- in fact, it's only because of historical accident that we don't store a single float field with that ratio, rather than two fields. Both the creation and the usage of those numbers work explicitly with the ratio. regards, tom lane
On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-02-01 14:05:46 -0800, Jeff Janes wrote: >> As far as I can tell this bug kicks in when your cluster gets to be >> older than freeze_min_age, and then lasts forever after. After that >> point pretty much every auto-vacuum inspired by update/deletion >> activity will get promoted to a full table scan. (Which makes me >> wonder how much field-testing the vm-only vacuum has received, if it >> was rarely happening in practice due to this bug.) > > I think you're misreading the code. freezeTableLimit is calculated by >> > limit = ReadNewTransactionId() - freezetable; > which is always relative to the current xid. The bug was that > freezetable had the wrong value in autovac due to freeze_min_age being > used instead of freeze_table_age. Right. Since freeze_min_age was mistakenly being used, the limit would be 50 million in the past (rather than 150 million) under defaults. But since the last full-table vacuum, whenever that was, used freeze_min_age for its intended purpose, that means the 50 million in the past *at the time of that last vacuum* is the highest that relfrozenxid can be. And that is going to be further back than 50 million from right now, so the vacuum will always be promoted to a full scan. I am not entirely sure of my logic above[1], but I'm depending on empirical observation for my conclusion. The attached patch emits a log entry telling if scan_all is being used, and it always is used (under the bug) once the database gets old enough. Or at least, I've never seen it not use scan_all after that point. As an aside, it does seem like log_autovacuum_min_duration=0 should log whether a scan_all was done, and if so what relfrozenxid got set to. But looking at where the log message is generated, I don't know where to retrieve that info. [1] I don't know why it is that a scan_all vacuum with a freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not set relfrozenxid to a higher value than that if it discovers that it can, but it doesn't seem to. Cheers, Jeff
Attachment
On Monday, January 28, 2013, Kevin Grittner wrote:
IMO, anything which changes an anti-wraparound vacuum of a
bulk-loaded table from "read the entire table and rewrite nearly
the complete table with WAL-logging" to rewriting a smaller portion
of the table with WAL-logging is an improvement. Anyone who has
run an OLTP load on a database which was loaded from pg_dump output
or other bulk load processes, has probably experienced the pain
related to the WAL-logged rewrite of massive quantities of data.
pgbench seems to be the OLTP load par excellence (or perhaps ad nauseum).
What other set up is needed to get it to reproduce this problem? Do we just do a dump/restore in lieu of pgbench -i?
Cheers,
Jeff
On 2013-02-01 15:09:34 -0800, Jeff Janes wrote: > On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-02-01 14:05:46 -0800, Jeff Janes wrote: > > >> As far as I can tell this bug kicks in when your cluster gets to be > >> older than freeze_min_age, and then lasts forever after. After that > >> point pretty much every auto-vacuum inspired by update/deletion > >> activity will get promoted to a full table scan. (Which makes me > >> wonder how much field-testing the vm-only vacuum has received, if it > >> was rarely happening in practice due to this bug.) > > > > I think you're misreading the code. freezeTableLimit is calculated by > > >> > limit = ReadNewTransactionId() - freezetable; > > > which is always relative to the current xid. The bug was that > > freezetable had the wrong value in autovac due to freeze_min_age being > > used instead of freeze_table_age. > > Right. Since freeze_min_age was mistakenly being used, the limit > would be 50 million in the past (rather than 150 million) under > defaults. But since the last full-table vacuum, whenever that was, > used freeze_min_age for its intended purpose, that means the 50 > million in the past *at the time of that last vacuum* is the highest > that relfrozenxid can be. And that is going to be further back than > 50 million from right now, so the vacuum will always be promoted to a > full scan. Oh, wow. Youre right. I shouldn't answer emails after sport with cramped fingers on a friday night... And I should have thought about this scenario, because I essentially already explained it upthread, just with a different set of variables. This is rather scary. How come nobody noticed that this major performance improvement was effectively disabled for that long? I wonder if Kevin's observations about the price of autovac during OLTPish workloads isn't at least partially caused by this. It will cause lots of io prematurely because it scans far more than it should and a VACUUM FREEZE will push it off. > As an aside, it does seem like log_autovacuum_min_duration=0 should > log whether a scan_all was done, and if so what relfrozenxid got set > to. But looking at where the log message is generated, I don't know > where to retrieve that info. Yes, I agree, I already had been thinking about that because its really hard to get that information right now. It seems easy enough to include it in the ereport() at the bottom of lazy_vacuum_rel, we determine scan_all in that function, so that seems ok? For head I would actually vote for two data points, full_table_scan: yes/no, skipped_percentage..., both are already available, so it seems like it should be an easy thing to do. I'd like to do this for 9.3, agreed? I would even like to add it to the back branches, but I guess I cannot convince people of that... > [1] I don't know why it is that a scan_all vacuum with a > freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not > set relfrozenxid to a higher value than that if it discovers that it > can, but it doesn't seem to. There currently is no code to track whats the oldest observed xid, so a simple implementation limitiation. Making that code better might be rather worthwile if youre loading your table in a batch and don't touch it later anymore... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-01 16:59:52 -0500, Robert Haas wrote: > I don't think I really understand the origin of the formula, so > perhaps if someone would try to characterize why it seems to behave > reasonably that would be helpful (at least to me). > > > f(deadtuples, relpages, age) = > > deadtuples/relpages + e ^ (age*ln(relpages)/2^32) > > To maybe make that discussion go more quickly let me kvetch about a > few things to kick things off: I am not too happy with the formula yet either, but it seems youve started the discussion into the right direction... > - It's probably important to have a formula where we can be sure that > the wrap-around term will eventually dominate the dead-tuple term, > with enough time to spare to make sure nothing really bad happens; on > the other hand, it's also desirable to avoid the case where a table > that has just crossed the threshold for wraparound vacuuming doesn't > immediately shoot to the top of the list even if it isn't truly > urgent. It's unclear to me just from looking at this formula how well > the second term meets those goals. I just wanted to mention that if everything goes well, we won't *ever* get to an anti-wraparound-vacuum. Normally the table should cross the vacuum_table_age barrier earlier and promote a normal vacuum to a full-table vacuum which will set relfrozenxid to a new and lower value and thus prevent anti-wraparound vacuums from occurring. So priorizing anti-wraparound vacuums immediately and heavily doesn't seem to be too bad. > - More generally, it seems to me that we ought to be trying to think > about the units in which these various quantities are measured. Each > term ought to be unit-less. So perhaps the first term ought to divide > dead tuples by total tuples, which has the nice property that the > result is a dimensionless quantity that never exceeds 1.0. Then the > second term can be scaled somehow based on that value. I think we also need to be careful to not try to get too elaborate on this end. Once the general code for priorization is in, the exact priorization formula can be easily incrementally tweaked. Just about any half-way sensible priorization is better than what we have right now and we might discover new effects once we do marginally better. Imo the browne_strength field should be called 'priority' and the priorization calculation formula should be moved qinto an extra function. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 1, 2013 at 6:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > As an aside, it does seem like log_autovacuum_min_duration=0 should > log whether a scan_all was done, and if so what relfrozenxid got set > to. That would be nifty. > [1] I don't know why it is that a scan_all vacuum with a > freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not > set relfrozenxid to a higher value than that if it discovers that it > can, but it doesn't seem to. That also seems very much worth fixing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 2, 2013 at 8:41 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> - It's probably important to have a formula where we can be sure that >> the wrap-around term will eventually dominate the dead-tuple term, >> with enough time to spare to make sure nothing really bad happens; on >> the other hand, it's also desirable to avoid the case where a table >> that has just crossed the threshold for wraparound vacuuming doesn't >> immediately shoot to the top of the list even if it isn't truly >> urgent. It's unclear to me just from looking at this formula how well >> the second term meets those goals. > > I just wanted to mention that if everything goes well, we won't *ever* > get to an anti-wraparound-vacuum. Normally the table should cross the > vacuum_table_age barrier earlier and promote a normal vacuum to a > full-table vacuum which will set relfrozenxid to a new and lower value > and thus prevent anti-wraparound vacuums from occurring. > So priorizing anti-wraparound vacuums immediately and heavily doesn't > seem to be too bad. IMHO, this is hopelessly optimistic. Yes, it's intended to work that way. But INSERT-only or INSERT-mostly tables are far from an uncommon use case; and in fact they're probably the most common cause of pain in this area. You insert a gajillion tuples, and vacuum never kicks off, and then eventually you either update some tuples or hit autovacuum_freeze_max_age and suddenly, BAM, you get this gigantic vacuum that rewrites the entire table. And then you open a support ticket with your preferred PostgreSQL support provider and say something like "WTF?". >> - More generally, it seems to me that we ought to be trying to think >> about the units in which these various quantities are measured. Each >> term ought to be unit-less. So perhaps the first term ought to divide >> dead tuples by total tuples, which has the nice property that the >> result is a dimensionless quantity that never exceeds 1.0. Then the >> second term can be scaled somehow based on that value. > > I think we also need to be careful to not try to get too elaborate on > this end. Once the general code for priorization is in, the exact > priorization formula can be easily incrementally tweaked. Just about any > half-way sensible priorization is better than what we have right now and > we might discover new effects once we do marginally better. I agree. It would be nice to have some way of measuring the positive or negative impact of what we introduce, too, but I don't have a good idea what that would be. > Imo the browne_strength field should be called 'priority' and the > priorization calculation formula should be moved qinto an extra > function. Yeah, or maybe vacuum_priority, since that would be easier to grep for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-02-02 11:25:01 -0500, Robert Haas wrote: > On Sat, Feb 2, 2013 at 8:41 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> - It's probably important to have a formula where we can be sure that > >> the wrap-around term will eventually dominate the dead-tuple term, > >> with enough time to spare to make sure nothing really bad happens; on > >> the other hand, it's also desirable to avoid the case where a table > >> that has just crossed the threshold for wraparound vacuuming doesn't > >> immediately shoot to the top of the list even if it isn't truly > >> urgent. It's unclear to me just from looking at this formula how well > >> the second term meets those goals. > > > > I just wanted to mention that if everything goes well, we won't *ever* > > get to an anti-wraparound-vacuum. Normally the table should cross the > > vacuum_table_age barrier earlier and promote a normal vacuum to a > > full-table vacuum which will set relfrozenxid to a new and lower value > > and thus prevent anti-wraparound vacuums from occurring. > > So priorizing anti-wraparound vacuums immediately and heavily doesn't > > seem to be too bad. > > IMHO, this is hopelessly optimistic. Yes, it's intended to work that > way. But INSERT-only or INSERT-mostly tables are far from an uncommon > use case; and in fact they're probably the most common cause of pain > in this area. You insert a gajillion tuples, and vacuum never kicks > off, and then eventually you either update some tuples or hit > autovacuum_freeze_max_age and suddenly, BAM, you get this gigantic > vacuum that rewrites the entire table. And then you open a support > ticket with your preferred PostgreSQL support provider and say > something like "WTF?". You're right, this doesn't work superbly well, especially for insert-only tables... But imo the place to fix it is not the priorization logic but relation_needs_vacanalyze, since fixing it in priorization won't prevent the BAM just the timing of it. I think scheduling a table for a partial vacuum every min_freeze * 2 xids, even if its insert only, would go a long way of reducing the impact of full-table vacuums. Obviously that would require to retain the last xid a vacuum was executed in... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: > You're right, this doesn't work superbly well, especially for > insert-only tables... But imo the place to fix it is not the > priorization logic but relation_needs_vacanalyze, since fixing it in > priorization won't prevent the BAM just the timing of it. Agreed. > I think scheduling a table for a partial vacuum every min_freeze * 2 > xids, even if its insert only, would go a long way of reducing the > impact of full-table vacuums. Obviously that would require to retain the > last xid a vacuum was executed in... I'm not sure that min_freeze * 2 is the right value, but otherwise agreed. I keep coming back to the idea that vacuum should have a high-priority queue and a low-priority queue. When stuff meets the current thresholds, it goes into the high-priority queue. But then there should be a low-priority queue where we do partial vacuums of things that meet some lower threshold - like the unfrozen portions of insert-only tables. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-02-02 14:54:10 -0500, Robert Haas wrote: > On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think scheduling a table for a partial vacuum every min_freeze * 2 > > xids, even if its insert only, would go a long way of reducing the > > impact of full-table vacuums. Obviously that would require to retain the > > last xid a vacuum was executed in... > > I'm not sure that min_freeze * 2 is the right value, but otherwise agreed. Yes, min_freeze * 2 was purely a very first quick guess. My reasoning was basically that values smaller than that won't really do all that much work since barely any of the tuples are old enough and values much bigger won't help much in reducing huge amounts of writes being submitted at the same time and they might miss the window before the next anti-freeze vacuum. I guess something like 5 or so would theoretically be better because it would reduce the number of tuples that are too young for freezing which will be marked all-visible nonetheless and thus not scanned again. But it would not work with the current relevant default settings: vacuum_freeze_table_age = 150,000,000 vacuum_freeze_min_age = 50,000,000 autovacuum_freeze_max_age = 200,000,000 I guess we will have to think about the default for those values. Adhoc I am thinking something like: vacuum_freeze_table_age = 300,000,000 vacuum_freeze_min_age = 20,000,000 autovacuum_freeze_max_age = 800,000,000 and scheduling a vacuum independent from n_dead_tuples every (freeze_table_age - freeze_max_age ) / 5 or so xids. That would mean that approx 4/5 (or more on a busy system) of the tuples would get frozen before a full-table vacuum. I don't the disk size argument for freeze_max_age = 200000000 is particularly relevant anymore, especially as we would normally keep the size at vacuum_freeze_table_age. To finally fix the issue ISTM that we need an 'age map' to know which parts to scan again and which parts never need to be scanned again, but thats a separate, not too small, feature. > I keep coming back to the idea that vacuum should have a high-priority > queue and a low-priority queue. When stuff meets the current > thresholds, it goes into the high-priority queue. But then there > should be a low-priority queue where we do partial vacuums of things > that meet some lower threshold - like the unfrozen portions of > insert-only tables. I don't think thats the most crucial part atm. Such a queue doesn't solve the problem that we don't want to do unneccesary, repetitive work. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-01 15:09:34 -0800, Jeff Janes wrote: > As an aside, it does seem like log_autovacuum_min_duration=0 should > log whether a scan_all was done, and if so what relfrozenxid got set > to. But looking at where the log message is generated, I don't know > where to retrieve that info. What about the following, very rough and quick, patch: LOG: automatic vacuum of table "postgres.public.data_1": index scans: 1 pages: 2703 removed, 2702 remain, 5405 (100.00%)scanned tuples: 499999 removed, 500001 remain full-scan: 1, freeze-limit: 28824, new-frozen-xid: 28824 buffer usage: 29957 hits, 2 misses, 4 dirtied avg read rate: 0.020 MB/s, avg write rate: 0.040 MB/s system usage:CPU 0.01s/0.67u sec elapsed 0.77 sec ... LOG: automatic vacuum of table "postgres.public.data_1": index scans: 1 pages: 2703 removed, 5404 remain, 5411 (66.74%)scanned tuples: 499999 removed, 1000071 remain full-scan: 0, freeze-limit: 28828, new-frozen-xid: - bufferusage: 34085 hits, 3 misses, 4 dirtied avg read rate: 0.027 MB/s, avg write rate: 0.036 MB/s system usage:CPU 0.01s/0.73u sec elapsed 0.86 sec It obviously needs more polish: - I opted for using the 64bit representation of xids, seems to be better in a log which very well might be looked at onlyafter some wraparounds - exporting 'txid' from adt/txid.c is pretty ugly. I don't like the invention of the type in general, but making it visibleoutside of txid.c is even uglier, but using both, plain uint64 and txid inside txid.c isn't nice either. - txid_from_xid should be renamed, don't have a good idea to what right now. - is there agreement on the additionally logged information? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-03 02:40:04 +0100, Andres Freund wrote: > On 2013-02-01 15:09:34 -0800, Jeff Janes wrote: > > As an aside, it does seem like log_autovacuum_min_duration=0 should > > log whether a scan_all was done, and if so what relfrozenxid got set > > to. But looking at where the log message is generated, I don't know > > where to retrieve that info. > > What about the following, very rough and quick, patch: -EINTR -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, Feb 2, 2013 at 2:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> You're right, this doesn't work superbly well, especially for >> insert-only tables... But imo the place to fix it is not the >> priorization logic but relation_needs_vacanalyze, since fixing it in >> priorization won't prevent the BAM just the timing of it. > > Agreed. > >> I think scheduling a table for a partial vacuum every min_freeze * 2 >> xids, even if its insert only, would go a long way of reducing the >> impact of full-table vacuums. Obviously that would require to retain the >> last xid a vacuum was executed in... > > I'm not sure that min_freeze * 2 is the right value, but otherwise agreed. > > I keep coming back to the idea that vacuum should have a high-priority > queue and a low-priority queue. When stuff meets the current > thresholds, it goes into the high-priority queue. But then there > should be a low-priority queue where we do partial vacuums of things > that meet some lower threshold - like the unfrozen portions of > insert-only tables. When I was thinking about your desire for "unitless" values, I found myself uncomfortable about that, and I think I've mentioned that. On further reflection, there's good reason. The need to vacuum tables with lots of dead tuples has very different characteristics from the need to vacuum tables to avoid XID rollover. Trying to force them onto the same units seems unlikely to turn out happily. On the other hand, I always thought that there was use for having multiple autovacuum queues, and giving queues different shaped policies, one for each purpose, seems like a mighty fine idea. That way we don't need to worry about mixing the policies. There can be two "best policies." I'd go further, and have 3 queues: a) A queue devoted to vacuuming small tables. Anything with more than [some number of relpages] need not apply. b) A queue devoted to vacuuming tables with a lot of dead tuples. c) A queue devoted to vacuuming tables before their XID rollover. The appropriate "strength" functions for b) and c) can be pretty simple, possibly the relevant bits of the functions that Nasby and I have suggested. And any time b) and c) find small tables, throw them to queue a), essentially doing the "quick & easy" vacuums. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
<div class="moz-cite-prefix">On 03/02/13 15:08, Christopher Browne wrote:<br /></div><blockquote cite="mid:CAFNqd5WXnHJsewFugQUDQD1SHbGRku_9-7P2BskzQX+qh8+mvQ@mail.gmail.com"type="cite"><pre wrap="">On Sat, Feb 2, 2013at 2:54 PM, Robert Haas <a class="moz-txt-link-rfc2396E" href="mailto:robertmhaas@gmail.com"><robertmhaas@gmail.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <a class="moz-txt-link-rfc2396E"href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a> wrote: </pre><blockquote type="cite"><pre wrap="">You're right, this doesn't work superbly well, especially for insert-only tables... But imo the place to fix it is not the priorization logic but relation_needs_vacanalyze, since fixing it in priorization won't prevent the BAM just the timing of it. </pre></blockquote><pre wrap=""> Agreed. </pre><blockquote type="cite"><pre wrap="">I think scheduling a table for a partial vacuum every min_freeze * 2 xids, even if its insert only, would go a long way of reducing the impact of full-table vacuums. Obviously that would require to retain the last xid a vacuum was executed in... </pre></blockquote><pre wrap=""> I'm not sure that min_freeze * 2 is the right value, but otherwise agreed. I keep coming back to the idea that vacuum should have a high-priority queue and a low-priority queue. When stuff meets the current thresholds, it goes into the high-priority queue. But then there should be a low-priority queue where we do partial vacuums of things that meet some lower threshold - like the unfrozen portions of insert-only tables. </pre></blockquote><pre wrap=""> When I was thinking about your desire for "unitless" values, I found myself uncomfortable about that, and I think I've mentioned that. On further reflection, there's good reason. The need to vacuum tables with lots of dead tuples has very different characteristics from the need to vacuum tables to avoid XID rollover. Trying to force them onto the same units seems unlikely to turn out happily. On the other hand, I always thought that there was use for having multiple autovacuum queues, and giving queues different shaped policies, one for each purpose, seems like a mighty fine idea. That way we don't need to worry about mixing the policies. There can be two "best policies." I'd go further, and have 3 queues: a) A queue devoted to vacuuming small tables. Anything with more than [some number of relpages] need not apply. b) A queue devoted to vacuuming tables with a lot of dead tuples. c) A queue devoted to vacuuming tables before their XID rollover. The appropriate "strength" functions for b) and c) can be pretty simple, possibly the relevant bits of the functions that Nasby and I have suggested. And any time b) and c) find small tables, throw them to queue a), essentially doing the "quick & easy" vacuums. </pre></blockquote><font size="-1">Hmm...<br /><br /><font size="-1">Could there be some measure of bloatedness?<br /></font></font><pclass="western" style="margin-bottom: 0cm"><font size="2">A table with 10 live rows and a 100 dead tuplesshould surely have a higher priority of being vacuumed than a table with a 1000 ro</font><font size="2">w</font><fontsize="2">s and 100 dead tuples? Especially for tables with <font size="2">hundreds of </font>millionsof rows<font size="2">!</font><br /></font><font size="-1"><font size="-1"><font size="-1"><font size="-1"><fontsize="-1"><font size="-1"><font size="-1"><br /><br /><font size="-1">Cheers,<br /><font size="-1">Gavin</font><br/></font></font></font></font></font></font></font></font>
Andres Freund <andres@2ndquadrant.com> writes: > It obviously needs more polish: > - I opted for using the 64bit representation of xids, seems to be better > in a log which very well might be looked at only after some > wraparounds > - exporting 'txid' from adt/txid.c is pretty ugly. I don't like the > invention of the type in general, but making it visible outside of > txid.c is even uglier, but using both, plain uint64 and txid inside > txid.c isn't nice either. > - txid_from_xid should be renamed, don't have a good idea to what right > now. > - is there agreement on the additionally logged information? -1 on using txids here. If memory serves, we have had exactly this discussion before and rejected spreading those into other parts of the system. That gets rid of three of your problems right there, as well as a lot of ugly hackery with UINT64_FORMAT. regards, tom lane
On 2013-02-03 11:17:42 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > It obviously needs more polish: > > > - I opted for using the 64bit representation of xids, seems to be better > > in a log which very well might be looked at only after some > > wraparounds > > - exporting 'txid' from adt/txid.c is pretty ugly. I don't like the > > invention of the type in general, but making it visible outside of > > txid.c is even uglier, but using both, plain uint64 and txid inside > > txid.c isn't nice either. > > - txid_from_xid should be renamed, don't have a good idea to what right > > now. > > - is there agreement on the additionally logged information? > > -1 on using txids here. If memory serves, we have had exactly this > discussion before and rejected spreading those into other parts > of the system. That gets rid of three of your problems right there, > as well as a lot of ugly hackery with UINT64_FORMAT. What about providing something like char *TransactionIdToEpochStrP() and implementing it in txid.c instead of transam.c? Not pretty but it wouldn't expose much to the outside? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-02-01 15:09:34 -0800, Jeff Janes wrote: >> Since freeze_min_age was mistakenly being used, the limit >> would be 50 million in the past (rather than 150 million) under >> defaults. But since the last full-table vacuum, whenever that was, >> used freeze_min_age for its intended purpose, that means the 50 >> million in the past *at the time of that last vacuum* is the highest >> that relfrozenxid can be. And that is going to be further back than >> 50 million from right now, so the vacuum will always be promoted to a >> full scan. > This is rather scary. How come nobody noticed that this major > performance improvement was effectively disabled for that long? > > I wonder if Kevin's observations about the price of autovac during > OLTPish workloads isn't at least partially caused by this. It will cause > lots of io prematurely because it scans far more than it should and a > VACUUM FREEZE will push it off. Well, there seem to be multiple problems with autovacuum which are serious enough that people are noticing severe hits to production performance, figuring out that it is autovacuum, turning off autovacuum and getting immediate relief, and then calling for support a few months later when the inevitable consequences hit. At that point, of course, there is a lot of pain, and it is hard to recover from it without a sizable maintenance window. I was able to confirm two cases where this was a consequence of the lazy truncate logic which Jan recently fixed, but there are clearly other problems which I didn't have much of a grasp on prior to this thread. The only thing I knew for sure was that a bulk load, and in particular a pg_dump/restore cycle, was a time bomb without VACUUM FREEZE, and users are often reluctant to add time for that to their maintenance window. If we can't load tuples as frozen in the first plance, the next best thing would be to nibble away at freezing them in the background, so that there is no single big hit. While proposals here seem to deal with very real problems which we should fix, I'm not sure that anything here addresses this issue. COPY FREEZE does address this, where it can be used; but I'm not sure that we don't need to address it in the autovacuum end of things, too. These problems are serious enough to merit cautious back-patching, in my view; the current state of affairs really is causing serious disruption of production environments. -Kevin
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-02-03 11:17:42 -0500, Tom Lane wrote: >> -1 on using txids here. If memory serves, we have had exactly this >> discussion before and rejected spreading those into other parts >> of the system. That gets rid of three of your problems right there, >> as well as a lot of ugly hackery with UINT64_FORMAT. > What about providing something like char *TransactionIdToEpochStrP() and > implementing it in txid.c instead of transam.c? Not pretty but it > wouldn't expose much to the outside? I'm objecting to the entire concept, not just how much cruft gets exposed outside txid.c. I went looking for the previous discussion and couldn't find it, but here are some significant reasons not to use txids for logging: * They don't have anything to do with the xids you can actually see in the database (at least not without mod-2^32 arithmetic that is hard to do in one's head). * txid_from_xid is very expensive because of the GetNextXidAndEpoch call; and if it got to be commonly used it would significantly increase contention for the XidGenLock lock. (Admittedly, two such calls per VACUUM probably don't mean anything. But once we establish a precedent of logging txids not xids, there's a slippery slope down to where it will be a problem.) * We've found bugs repeatedly in the txid epoch conversion code, and I have little confidence that there aren't more. (The fact that your patch found it necessary to touch convert_xid() isn't exactly improving my opinion of this code, either.) Accordingly, I think that log entries involving txids would be materially less reliable than if we just print the xids and have done. regards, tom lane
On 2013-02-03 13:26:25 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-02-03 11:17:42 -0500, Tom Lane wrote: > >> -1 on using txids here. If memory serves, we have had exactly this > >> discussion before and rejected spreading those into other parts > >> of the system. That gets rid of three of your problems right there, > >> as well as a lot of ugly hackery with UINT64_FORMAT. > > > What about providing something like char *TransactionIdToEpochStrP() and > > implementing it in txid.c instead of transam.c? Not pretty but it > > wouldn't expose much to the outside? > > I'm objecting to the entire concept, not just how much cruft gets > exposed outside txid.c. Ok, I can live with that. The reason I wanted to log txids instead of plain xids is exactly that youre basically required to do the mod-2^32 arithmetic to understand the numbers, the xids frequently are wrapped around. E.g. the freeze-xid in a somewhat new cluster will be something like 4094966496 which isn't that easy to interpret. Even moreso with relfrozenxids et al. I personally find it relatively hard to compare an xid like 4094966496, even moreso when comparing a wrapped arround value with one not. > The fact that your patch found it necessary to touch convert_xid() > isn't exactly improving my opinion of this code, either. Thats just because it couldn't handle xids that are essentially in epoch -1. E.g. the the freeze limit in a new cluster will wrap-around into that. (~800 - 200mio). Any other opinions on the content of whats being logged? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 3, 2013 at 9:25 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > > I was able to confirm two cases where this was a consequence of the > lazy truncate logic which Jan recently fixed, but there are clearly > other problems which I didn't have much of a grasp on prior to this > thread. The only thing I knew for sure was that a bulk load, and > in particular a pg_dump/restore cycle, was a time bomb without > VACUUM FREEZE, and users are often reluctant to add time for that > to their maintenance window. If we can't load tuples as frozen in > the first plance, the next best thing would be to nibble away at > freezing them in the background, so that there is no single big > hit. While proposals here seem to deal with very real problems > which we should fix, I'm not sure that anything here addresses this > issue. It seems like we already have all the machinery we need for this. Don't run VACUUM FREEZE during the maintenance window, instead at the end of the maintenance window set vacuum_cost_delay to nibble (whatever integer that translates to on your system) and let 'er rip. Since vacuum_cost_delay is session settable, it wouldn't interfere with the setting for other things. The only real problem I see with this is how to restart it so that it picks up where it left off if it should get interrupted. I guess giving vacuumdb new options to set the vacuum_cost settings it runs under would be a handy feature for doing this. Cheers, Jeff
On Sat, Feb 2, 2013 at 5:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-02-01 15:09:34 -0800, Jeff Janes wrote: >> On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-02-01 14:05:46 -0800, Jeff Janes wrote: >> >> >> As far as I can tell this bug kicks in when your cluster gets to be >> >> older than freeze_min_age, and then lasts forever after. After that >> >> point pretty much every auto-vacuum inspired by update/deletion >> >> activity will get promoted to a full table scan. (Which makes me >> >> wonder how much field-testing the vm-only vacuum has received, if it >> >> was rarely happening in practice due to this bug.) >> > >> > I think you're misreading the code. freezeTableLimit is calculated by >> >> >> > limit = ReadNewTransactionId() - freezetable; >> >> > which is always relative to the current xid. The bug was that >> > freezetable had the wrong value in autovac due to freeze_min_age being >> > used instead of freeze_table_age. >> >> Right. Since freeze_min_age was mistakenly being used, the limit >> would be 50 million in the past (rather than 150 million) under >> defaults. But since the last full-table vacuum, whenever that was, >> used freeze_min_age for its intended purpose, that means the 50 >> million in the past *at the time of that last vacuum* is the highest >> that relfrozenxid can be. And that is going to be further back than >> 50 million from right now, so the vacuum will always be promoted to a >> full scan. > > Oh, wow. Youre right. I shouldn't answer emails after sport with cramped > fingers on a friday night... And I should have thought about this > scenario, because I essentially already explained it upthread, just with > a different set of variables. > > This is rather scary. How come nobody noticed that this major > performance improvement was effectively disabled for that long? I'm not sure whom to address this to, but the just-committed release notes for this issue reflect the original understanding that it only applied when vacuum_freeze_min_age was lowered from its default. Rather than the current understanding that it effects all old-enough systems. If the release notes are not already baked in, I would suggest this wording: + The main consequence of this mistake is that it + caused full-table vacuuming scans to occur much more frequently + than intended. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sat, Feb 2, 2013 at 5:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: > If the release notes are not already baked in, I would suggest this wording: > + The main consequence of this mistake is that it > + caused full-table vacuuming scans to occur much more frequently > + than intended. It's baked ... sorry about that, but when making the notes there's seldom time to go through the threads about every patch to see if the commit messages are accurate or not. regards, tom lane
On 1/31/13 2:18 PM, Alvaro Herrera wrote: > My intention was to apply a Nasby correction to Browne Strength and call > the resulting function Browne' (Browne prime). Does that sound better? I suggest painting that bikeshed "Browneby". :P
Here's another cut at this patch. This is mainly about the infrastructure to pass the data around in autovacuum; the proposed formulas probably need lot of work. We still have two terms in autovacuum priority, the first one considers dead tuples and the second one considers wraparound limit. I have kept Chris' proposal for the second term, but refined the first one a bit per Jim Nasby's suggestion of discounting dead space. So we now have return (d->p_deadtuples * (d->p_livetuples + d->p_deadtuples) * d->p_relpages) / ((d->p_livetuples + 1) * d->p_reltuples * nblocks) + exp(d->p_xidage * logf(nblocks) / UINT_MAX); Here, deadtuples and livetuples come from pgstat data, while relpages and reltuples come from pg_class. nblocks, on the other hand, comes from the actual number of blocks in the table. I haven't considered the case where pg_class.reltuples = 0 (which results in division-by-zero), but I think to be really robust here we'd want to have some code copied from estimate_rel_size; or maybe simply use some hardcoded magic value. I lean towards the latter, because I'm not sure we want to expend a relation open at this point (incurring an attempt to lock the table, which could be problematic); hence the new RelFileNodeGetNumberOfBlocks() thingy, which is admittedly pretty ugly, not to mention untested. (I have considered livetuples=0 however, hence the +1 there). I think we now need to have a more focused discussion on useful formulas to use here. One thing I noticed that fails in the above formula is that as nblocks grows, ceteris paribus, the score falls; but that's wrong, because if you have a table that turns out to have much larger nblocks because it bloated and pgstat lost the message, we need to look harder at it. So somehow we need to consider the tuple density as given by pg_class.reltuples/pg_class.relpages, and compare with the one given by pgstat.(live+dead) / nblocks; and raise the score as the ratio goes down (in normal conditions the ratio should be 1; a bloated table that pgstat hasn't noticed will have a lower ratio). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services