Thread: Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Florian Helmberger <fh@25th-floor.com> writes: > On 25.05.11 04:47, Tom Lane wrote: >> Florian Helmberger<fh@25th-floor.com> writes: >>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on >>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent >>> autovacuum runs. >>> >>> I think I've pinned the problem down to the values pg_class holds for >>> the affected TOAST table: >>> >>> relpages | 433596 >>> reltuples | 1868538 >>> >>> These values are significantly too low. Interestingly, the autovacuum >>> logout reports the correct values: >>> >>> pages: 0 removed, 34788136 remain >>> tuples: 932487 removed, 69599038 remain >>> >>> but these aren't stored in pg_class after each run. >> That's exceedingly weird. Do the pg_stat_all_tables columns update >> after autovacuums on that table? > Yes they do: I think I see what must be going on here: that toast table must contain a long run of all-visible-according-to-the-VM pages (which is hardly a surprising situation). This results in VACUUM choosing not to update the pg_class entry: /* * Update statistics in pg_class. But only if we didn't skip any pages; * the tuple count only includes tuplesfrom the pages we've visited, and * we haven't frozen tuples in unvisited pages either. The page count is *accurate in any case, but because we use the reltuples / relpages ratio * in the planner, it's better to not update relpageseither if we can't * update reltuples. */ if (vacrelstats->scanned_all) vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, vacrelstats->hasindex, FreezeLimit); For an ordinary table this wouldn't be fatal because we'll still do an ANALYZE from time to time, and that will update the entries with new (approximate) values. But we never run ANALYZE on toast tables. And this would *still* be okay, because as noted in the comment, the planner only depends on the ratio being roughly correct, not on either individual value being current. But autovacuum didn't get the memo; it thinks it can use reltuples to make decisions. I can see two basic approaches we might take here: 1. Modify autovacuum to use something from the stats collector, rather than reltuples, to make its decisions. I'm not too clear on why reltuples is being used there anyway; is there some good algorithmic or statistical reason why AV should be looking at a number from the last vacuum? 2. Revise the vacuum code so that it doesn't skip updating the pg_class entries. We could have it count the number of pages it skipped, rather than just keeping a bool, and then scale up the rel_tuples count to be approximately right by assuming the skipped pages have tuple density similar to the scanned ones. Thoughts? regards, tom lane
Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011: > I think I see what must be going on here: that toast table must contain > a long run of all-visible-according-to-the-VM pages (which is hardly a > surprising situation). This results in VACUUM choosing not to update > the pg_class entry: > > /* > * Update statistics in pg_class. But only if we didn't skip any pages; > * the tuple count only includes tuples from the pages we've visited, and > * we haven't frozen tuples in unvisited pages either. The page count is > * accurate in any case, but because we use the reltuples / relpages ratio > * in the planner, it's better to not update relpages either if we can't > * update reltuples. > */ > if (vacrelstats->scanned_all) > vac_update_relstats(onerel, > vacrelstats->rel_pages, vacrelstats->rel_tuples, > vacrelstats->hasindex, > FreezeLimit); > > For an ordinary table this wouldn't be fatal because we'll still do an > ANALYZE from time to time, and that will update the entries with new > (approximate) values. But we never run ANALYZE on toast tables. Ouch. > I can see two basic approaches we might take here: > > 1. Modify autovacuum to use something from the stats collector, rather > than reltuples, to make its decisions. I'm not too clear on why > reltuples is being used there anyway; is there some good algorithmic or > statistical reason why AV should be looking at a number from the last > vacuum? It uses reltuples simply because it was what the original contrib code was using. Since pgstat was considerably weaker at the time, reltuples might have been the only thing available. It's certainly the case that pgstat has improved a lot since autovacuum got in, and some things have been revised but not this one. > 2. Revise the vacuum code so that it doesn't skip updating the pg_class > entries. We could have it count the number of pages it skipped, rather > than just keeping a bool, and then scale up the rel_tuples count to be > approximately right by assuming the skipped pages have tuple density > similar to the scanned ones. Hmm, interesting idea. This would be done only for toast tables, or all tables? At this point I only wonder why we store tuples & pages rather than just live tuple density. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 2. Revise the vacuum code so that it doesn't skip updating the pg_class > entries. We could have it count the number of pages it skipped, rather > than just keeping a bool, and then scale up the rel_tuples count to be > approximately right by assuming the skipped pages have tuple density > similar to the scanned ones. This approach doesn't seem like a good idea to me. The skipped portions of the table are the ones that haven't been modified in a while, so this is or embeds an assumption that the tuples in the hot and cold portions of the table are the same size. That might be true, but it isn't hard to think of cases where it might not be. There could also very easily be sampling error, because it's easy to imagine a situation where 99% of the table is getting skipped. Any error that creeps into the estimate is going to get scaled up along with the estimate itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011: >> I can see two basic approaches we might take here: >> >> 1. Modify autovacuum to use something from the stats collector, rather >> than reltuples, to make its decisions. I'm not too clear on why >> reltuples is being used there anyway; is there some good algorithmic or >> statistical reason why AV should be looking at a number from the last >> vacuum? > It uses reltuples simply because it was what the original contrib code > was using. Since pgstat was considerably weaker at the time, reltuples > might have been the only thing available. It's certainly the case that > pgstat has improved a lot since autovacuum got in, and some things have > been revised but not this one. On reflection I'm hesitant to do this, especially for a backpatched bug fix, because it would be changing the feedback loop behavior for autovacuum scheduling. That could have surprising consequences. >> 2. Revise the vacuum code so that it doesn't skip updating the pg_class >> entries. We could have it count the number of pages it skipped, rather >> than just keeping a bool, and then scale up the rel_tuples count to be >> approximately right by assuming the skipped pages have tuple density >> similar to the scanned ones. > Hmm, interesting idea. This would be done only for toast tables, or all > tables? I'm thinking just do it for all. The fact that these numbers don't necessarily update after a vacuum is certainly surprising in and of itself, and it did not work that way before the VM patch went in. I'm concerned about other stuff besides AV not dealing well with obsolete values. > At this point I only wonder why we store tuples & pages rather than just > live tuple density. It's just for backwards compatibility. I've thought about doing that in the past, but I don't know what client-side code might be looking at relpages/reltuples. It's not like collapsing them into one field would save much, anyway. regards, tom lane
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't know what client-side code might be looking at > relpages/reltuples. I know that I find reltuples useful for getting an "accurate enough" sense of rows in a table (or set of tables) without resorting to count(*). I'd be OK with any two of pages, tuples, and density; but dropping to one would make databases harder to manage, IMV. Personally, I don't have much code that uses those columns; eliminating an existing column wouldn't involve much pain for me as long as it could still be derived. -Kevin
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 2. Revise the vacuum code so that it doesn't skip updating the pg_class >> entries. �We could have it count the number of pages it skipped, rather >> than just keeping a bool, and then scale up the rel_tuples count to be >> approximately right by assuming the skipped pages have tuple density >> similar to the scanned ones. > This approach doesn't seem like a good idea to me. The skipped > portions of the table are the ones that haven't been modified in a > while, so this is or embeds an assumption that the tuples in the hot > and cold portions of the table are the same size. That might be true, > but it isn't hard to think of cases where it might not be. There > could also very easily be sampling error, because it's easy to imagine > a situation where 99% of the table is getting skipped. Yeah, I had been thinking about the latter point. We could be conservative and just keep the reported tuple density the same (ie, update relpages to the new correct value, while setting reltuples so that the density ratio doesn't change). But that has its own problems when the table contents *do* change. What I'm currently imagining is to do a smoothed moving average, where we factor in the new density estimate with a weight dependent on the percentage of the table we did scan. That is, the calculation goes something like old_density = old_reltuples / old_relpages new_density = counted_tuples / scanned_pages reliability = scanned_pages / new_relpages updated_density = old_density + (new_density - old_density) * reliability new_reltuples = updated_density * new_relpages We could slow the moving-average convergence even further when reliability is small; for instance if you were really paranoid you might want to use the square of reliability in line 4. That might be overdoing it, though. regards, tom lane
On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I had been thinking about the latter point. We could be > conservative and just keep the reported tuple density the same (ie, > update relpages to the new correct value, while setting reltuples so > that the density ratio doesn't change). But that has its own problems > when the table contents *do* change. What I'm currently imagining is > to do a smoothed moving average, where we factor in the new density > estimate with a weight dependent on the percentage of the table we did > scan. That is, the calculation goes something like > > old_density = old_reltuples / old_relpages > new_density = counted_tuples / scanned_pages > reliability = scanned_pages / new_relpages > updated_density = old_density + (new_density - old_density) * reliability > new_reltuples = updated_density * new_relpages > > We could slow the moving-average convergence even further when > reliability is small; for instance if you were really paranoid you might > want to use the square of reliability in line 4. That might be > overdoing it, though. I don't know. That's maybe better, but I'd be willing to wager that in some cases it will just slow down the rate at which we converge to a completely incorrect value, while in other cases it'll fail to update the data when it really has changed. I am wondering, though, why we're not just inserting a special-purpose hack for TOAST tables. Your email seems to indicate that regular tables are already handled well enough, and certainly if we only whack around the TOAST behavior it's much less likely to fry anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I don't know what client-side code might be looking at > > relpages/reltuples. > > I know that I find reltuples useful for getting an "accurate enough" > sense of rows in a table (or set of tables) without resorting to > count(*). I'd be OK with any two of pages, tuples, and density; but > dropping to one would make databases harder to manage, IMV. > > Personally, I don't have much code that uses those columns; > eliminating an existing column wouldn't involve much pain for me as > long as it could still be derived. Well, we only actually need to store one number, because you can figure out a much more precise number-of-pages figure with pg_relation_size() divided by configured page size. (We feel free to hack around catalogs in other places, and we regularly break pgAdmin and the like when we drop columns -- people just live with it and update their tools. I don't think it's such a big deal in this particular case.) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, I had been thinking about the latter point. �We could be >> conservative and just keep the reported tuple density the same (ie, >> update relpages to the new correct value, while setting reltuples so >> that the density ratio doesn't change). �But that has its own problems >> when the table contents *do* change. �What I'm currently imagining is >> to do a smoothed moving average, where we factor in the new density >> estimate with a weight dependent on the percentage of the table we did >> scan. �That is, the calculation goes something like >> >> old_density = old_reltuples / old_relpages >> new_density = counted_tuples / scanned_pages >> reliability = scanned_pages / new_relpages >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages >> >> We could slow the moving-average convergence even further when >> reliability is small; for instance if you were really paranoid you might >> want to use the square of reliability in line 4. �That might be >> overdoing it, though. > I don't know. That's maybe better, but I'd be willing to wager that > in some cases it will just slow down the rate at which we converge to > a completely incorrect value, while in other cases it'll fail to > update the data when it really has changed. [ shrug... ] When you don't have complete information, it's *always* the case that you will sometimes make a mistake. That's not justification for paralysis, especially not when the existing code is demonstrably broken. What occurs to me after thinking a bit more is that the existing tuple density is likely to be only an estimate, too (one coming from the last ANALYZE, which could very well have scanned even less of the table than VACUUM did). So what I now think is that both VACUUM and ANALYZE ought to use a calculation of the above form to arrive at a new value for pg_class.reltuples. In both cases it would be pretty easy to track the number of pages we looked at while counting tuples, so the same raw information is available. > I am wondering, though, why we're not just inserting a special-purpose > hack for TOAST tables. Because the problem is not specific to TOAST tables. As things currently stand, we will accept the word of an ANALYZE as gospel even if it scanned 1% of the table, and completely ignore the results from a VACUUM even if it scanned 99% of the table. This is not sane. regards, tom lane
Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011: > I don't know. That's maybe better, but I'd be willing to wager that > in some cases it will just slow down the rate at which we converge to > a completely incorrect value, while in other cases it'll fail to > update the data when it really has changed. For regular tables I don't think there's a real problem because it'll get fixed next time a full scan happens anyway. For toast tables, I think the set of operations is limited enough that this is easy to prove correct (or fixed so that it is) -- no HOT updates (in fact no updates at all), etc. BTW one thing we haven't fixed at all is how do HOT updates affect vacuuming behavior ... > I am wondering, though, why we're not just inserting a special-purpose > hack for TOAST tables. Your email seems to indicate that regular > tables are already handled well enough, and certainly if we only whack > around the TOAST behavior it's much less likely to fry anything. Well, having two paths one of which is uncommonly used means that it will get all the bugs. As with the xl_commit WAL record comment from Simon, I'd rather stick with having a single one. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > [ shrug... ] When you don't have complete information, it's *always* > the case that you will sometimes make a mistake. That's not > justification for paralysis, especially not when the existing code is > demonstrably broken. > > What occurs to me after thinking a bit more is that the existing tuple > density is likely to be only an estimate, too (one coming from the last > ANALYZE, which could very well have scanned even less of the table than > VACUUM did). So what I now think is that both VACUUM and ANALYZE ought > to use a calculation of the above form to arrive at a new value for > pg_class.reltuples. In both cases it would be pretty easy to track the > number of pages we looked at while counting tuples, so the same raw > information is available. > >> I am wondering, though, why we're not just inserting a special-purpose >> hack for TOAST tables. > > Because the problem is not specific to TOAST tables. As things > currently stand, we will accept the word of an ANALYZE as gospel even if > it scanned 1% of the table, and completely ignore the results from a > VACUUM even if it scanned 99% of the table. This is not sane. I agree that if VACUUM scanned 99% of the table, it's probably fine to use its numbers. It's also fine to use the numbers from ANALYZE, because those pages are chosen randomly. What bothers me is the idea of using a small *non-random* sample, and I'm not sure that incorporating possibly-bogus results slowly is any better than incorporating them quickly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 25, 2011 at 1:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I agree that if VACUUM scanned 99% of the table, it's probably fine to > use its numbers. It's also fine to use the numbers from ANALYZE, > because those pages are chosen randomly. What bothers me is the idea > of using a small *non-random* sample, and I'm not sure that > incorporating possibly-bogus results slowly is any better than > incorporating them quickly. In particular, unless I'm misremembering, VACUUM *always* scans the first few pages of the table, until it sees enough consecutive all-visible bits that it decides to start skipping. If I'm right about that, then those pages could easily end up being overweighted when VACUUM does the counting; especially if ANALYZE or an actual full-table vacuum aren't allowed to snap the count back to reality. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
Cédric Villemain
Date:
2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>: > Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011: >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > I don't know what client-side code might be looking at >> > relpages/reltuples. >> >> I know that I find reltuples useful for getting an "accurate enough" >> sense of rows in a table (or set of tables) without resorting to >> count(*). I'd be OK with any two of pages, tuples, and density; but >> dropping to one would make databases harder to manage, IMV. >> >> Personally, I don't have much code that uses those columns; >> eliminating an existing column wouldn't involve much pain for me as >> long as it could still be derived. > > Well, we only actually need to store one number, because you can figure > out a much more precise number-of-pages figure with pg_relation_size() > divided by configured page size. > > (We feel free to hack around catalogs in other places, and we regularly > break pgAdmin and the like when we drop columns -- people just live with > it and update their tools. I don't think it's such a big deal in this > particular case.) I may miss something but we need relation size in costsize.c even if we have a reldensity (or we need a reltuples). Else what values should be used to estimate the relation size ? (pg_relation_size() goes down to kernel/fs to ask the stat.st_size, is it really what we want?) > > -- > Álvaro Herrera <alvherre@commandprompt.com> > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011: > > Well, we only actually need to store one number, because you can figure > > out a much more precise number-of-pages figure with pg_relation_size() > > divided by configured page size. > I may miss something but we need relation size in costsize.c even if > we have a reldensity (or we need a reltuples). Else what values should > be used to estimate the relation size ? (pg_relation_size() goes down > to kernel/fs to ask the stat.st_size, is it really what we want?) Actually yes, the planner does go to kernel to determine the current relation size, and then multiplies by density as computed from catalog data to figure out current reasonably accurate number of tuples. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
Cédric Villemain
Date:
2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>: > Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011: > >> > Well, we only actually need to store one number, because you can figure >> > out a much more precise number-of-pages figure with pg_relation_size() >> > divided by configured page size. > >> I may miss something but we need relation size in costsize.c even if >> we have a reldensity (or we need a reltuples). Else what values should >> be used to estimate the relation size ? (pg_relation_size() goes down >> to kernel/fs to ask the stat.st_size, is it really what we want?) > > Actually yes, the planner does go to kernel to determine the current > relation size, and then multiplies by density as computed from catalog > data to figure out current reasonably accurate number of tuples. Okay! I just read that part. Interesting. (If I dive correctly, we search our last segment and then use a fileseek to the end of this segment to get our information) make more sense, suddendly :) > > -- > Álvaro Herrera <alvherre@commandprompt.com> > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Because the problem is not specific to TOAST tables. �As things >> currently stand, we will accept the word of an ANALYZE as gospel even if >> it scanned 1% of the table, and completely ignore the results from a >> VACUUM even if it scanned 99% of the table. �This is not sane. > I agree that if VACUUM scanned 99% of the table, it's probably fine to > use its numbers. It's also fine to use the numbers from ANALYZE, > because those pages are chosen randomly. What bothers me is the idea > of using a small *non-random* sample, and I'm not sure that > incorporating possibly-bogus results slowly is any better than > incorporating them quickly. The above is simply fuzzy thinking. The fact that ANALYZE looked at a random subset of pages is *no guarantee whatsoever* that its results are highly accurate. They might be more trustworthy than VACUUM's nonrandom sample of a similar number of pages, but it doesn't hold even a little bit of water to claim that we should believe ANALYZE completely and VACUUM not at all even when the latter has looked at a significantly larger sample of pages. In any case, your line of thought doesn't help us for fixing the problem with toast tables, because we aren't going to start doing ANALYZEs on toast tables. The bottom line here is that making use of stats we have is a lot better than not making use of them, even if they aren't entirely trustworthy. regards, tom lane
On Wed, May 25, 2011 at 5:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Because the problem is not specific to TOAST tables. As things >>> currently stand, we will accept the word of an ANALYZE as gospel even if >>> it scanned 1% of the table, and completely ignore the results from a >>> VACUUM even if it scanned 99% of the table. This is not sane. > >> I agree that if VACUUM scanned 99% of the table, it's probably fine to >> use its numbers. It's also fine to use the numbers from ANALYZE, >> because those pages are chosen randomly. What bothers me is the idea >> of using a small *non-random* sample, and I'm not sure that >> incorporating possibly-bogus results slowly is any better than >> incorporating them quickly. > > The above is simply fuzzy thinking. The fact that ANALYZE looked at a > random subset of pages is *no guarantee whatsoever* that its results are > highly accurate. They might be more trustworthy than VACUUM's nonrandom > sample of a similar number of pages, but it doesn't hold even a little > bit of water to claim that we should believe ANALYZE completely and > VACUUM not at all even when the latter has looked at a significantly > larger sample of pages. I think you're arguing against a straw-man. > In any case, your line of thought doesn't help us for fixing the problem > with toast tables, because we aren't going to start doing ANALYZEs on > toast tables. Can we simply use a constant for tuple density on TOAST tables? > The bottom line here is that making use of stats we have is a lot better > than not making use of them, even if they aren't entirely trustworthy. http://dilbert.com/strips/comic/2008-05-07/ -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 25, 2011 at 10:05 PM, Greg Stark <gsstark@mit.edu> wrote: >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages > > This amounts to assuming that the pages observed in the vacuum have > the density observed and the pages that weren't seen have the density > that were previously in the reltuples/relpages stats. In case it's not clear, Tom's expression for updated_density is equivalent by simple algebra to: updated_density = (old_density * (1-reliability)) + (new_density * reliability) -- greg
On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I had been thinking about the latter point. We could be > conservative and just keep the reported tuple density the same (ie, > update relpages to the new correct value, while setting reltuples so > that the density ratio doesn't change). But that has its own problems > when the table contents *do* change. What I'm currently imagining is > to do a smoothed moving average, where we factor in the new density > estimate with a weight dependent on the percentage of the table we did > scan. That is, the calculation goes something like > > old_density = old_reltuples / old_relpages > new_density = counted_tuples / scanned_pages > reliability = scanned_pages / new_relpages > updated_density = old_density + (new_density - old_density) * reliability > new_reltuples = updated_density * new_relpages This amounts to assuming that the pages observed in the vacuum have the density observed and the pages that weren't seen have the density that were previously in the reltuples/relpages stats. That seems like a pretty solid approach to me. If the numbers were sane before it follows that they should be sane after the update. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... What I'm currently imagining is >> to do a smoothed moving average, where we factor in the new density >> estimate with a weight dependent on the percentage of the table we did >> scan. That is, the calculation goes something like >> >> old_density = old_reltuples / old_relpages >> new_density = counted_tuples / scanned_pages >> reliability = scanned_pages / new_relpages >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages > This amounts to assuming that the pages observed in the vacuum have > the density observed and the pages that weren't seen have the density > that were previously in the reltuples/relpages stats. That seems like > a pretty solid approach to me. If the numbers were sane before it > follows that they should be sane after the update. Hm, that's an interesting way of looking at it, but I was coming at it from a signal-processing point of view. What Robert is concerned about is that if VACUUM is cleaning a non-representative sample of pages, and repeated VACUUMs examine pretty much the same sample each time, then over repeated applications of the above formula the estimated density will eventually converge to what we are seeing in the sample. The speed of convergence depends on the moving-average multiplier, ie the "reliability" number above, and what I was after was just to slow down convergence for smaller samples. So I wouldn't have any problem with including a fudge factor to make the convergence even slower. But your analogy makes it seem like this particular formulation is actually "right" in some sense. One other point here is that Florian's problem is really only with our failing to update relpages. I don't think there is any part of the system that particularly cares about reltuples for a toast table. So even if the value did converge to some significantly-bad estimate over time, it's not really an issue AFAICS. We do care about having a sane reltuples estimate for regular tables, but for those we should have a mixture of updates from ANALYZE and updates from VACUUM. Also, for both regular and toast tables we will have an occasional vacuum-for-wraparound that is guaranteed to scan all pages and hence do a hard reset of reltuples to the correct value. I'm still of the opinion that an incremental estimation process like the above is a lot saner than what we're doing now, snarky Dilbert references notwithstanding. The only thing that seems worthy of debate from here is whether we should trust ANALYZE's estimates a bit more than VACUUM's estimates, on the grounds that the former are more likely to be from a random subset of pages. We could implement that by applying a fudge factor when folding a VACUUM estimate into the moving average (ie, multiply its reliability by something less than one). I don't have any principled suggestion for just what the fudge factor ought to be, except that I don't think "zero" is the best value, which AFAICT is what Robert is arguing. I think Greg's argument shows that "one" is the right value when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a random set of pages ... but using that for VACUUM does seem overoptimistic. regards, tom lane
On Thu, May 26, 2011 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm still of the opinion that an incremental estimation process like > the above is a lot saner than what we're doing now, snarky Dilbert > references notwithstanding. The only thing that seems worthy of debate > from here is whether we should trust ANALYZE's estimates a bit more than > VACUUM's estimates, on the grounds that the former are more likely to be > from a random subset of pages. We could implement that by applying a > fudge factor when folding a VACUUM estimate into the moving average (ie, > multiply its reliability by something less than one). I don't have any > principled suggestion for just what the fudge factor ought to be, except > that I don't think "zero" is the best value, which AFAICT is what Robert > is arguing. I think Greg's argument shows that "one" is the right value > when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a > random set of pages ... but using that for VACUUM does seem > overoptimistic. The problem is that it's quite difficult to predict the relative frequency of full-relation-vacuum, vacuum-with-skips, and ANALYZE operations on the table will be. It matters how fast the table is being inserted into vs. updated/deleted; and it also matters how fast the table is being updated compared with the system's rate of XID consumption. So in general it seems hard to say, well, we know this number might drift off course a little bit, but there will be a freezing vacuum or analyze or something coming along soon enough to fix the problem. There might be, but it's difficult to be sure. My argument isn't so much that using a non-zero value here is guaranteed to have bad effects, but that we really have no idea what will work out well in practice, and therefore it seems dangerous to whack the behavior around ... especially in stable branches. If we changed this in 9.1, and that's the last time we ever get a complaint about it, problem solved. But I would feel bad if we changed this in the back-branches and then found that, while solving this particular problem, we had created others. It also seems likely that the replacement problems would be more subtle and more difficult to diagnose, because they'd depend in a very complicated way on the workload, and having, say, the latest table contents would not necessarily enable us to reproduce the problem. I would feel a lot better about something that is deterministic, like, I dunno, if VACUUM visits more than 25% of the table, we use its estimate. And we always use ANALYZE's estimate. Or something. Another thought: Couldn't relation_needs_vacanalyze() just scale up reltuples by the ratio of the current number of pages in the relation to relpages, just as the query planner does? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I would feel a lot better about something that is deterministic, like, > I dunno, if VACUUM visits more than 25% of the table, we use its > estimate. And we always use ANALYZE's estimate. Or something. This argument seems to rather miss the point. The data we are working from is fundamentally not deterministic, and you can't make it so by deciding to ignore what data we do have. That leads to a less useful estimate, not a more useful estimate. > Another thought: Couldn't relation_needs_vacanalyze() just scale up > reltuples by the ratio of the current number of pages in the relation > to relpages, just as the query planner does? Hmm ... that would fix Florian's immediate issue, and it does seem like a good change on its own merits. But it does nothing for the problem that we're failing to put the best available information into pg_class. Possibly we could compromise on doing just that much in the back branches, and the larger change for 9.1? regards, tom lane
On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Another thought: Couldn't relation_needs_vacanalyze() just scale up >> reltuples by the ratio of the current number of pages in the relation >> to relpages, just as the query planner does? > > Hmm ... that would fix Florian's immediate issue, and it does seem like > a good change on its own merits. But it does nothing for the problem > that we're failing to put the best available information into pg_class. > > Possibly we could compromise on doing just that much in the back > branches, and the larger change for 9.1? Do you think we need to worry about the extra overhead of determining the current size of every relation as we sweep through pg_class? It's not a lot, but OTOH I think we'd be doing it once a minute... not sure what would happen if there were tons of tables. Going back to your thought upthread, I think we should really consider replacing reltuples with reltupledensity at some point. I continue to be afraid that using a decaying average in this case is going to end up overweighting the values from some portion of the table that's getting scanned repeatedly, at the expense of other portions of the table that are not getting scanned at all. Now, perhaps experience will prove that's not a problem. But storing relpages and reltupledensity separately would give us more flexibility, because we could feel free to bump relpages even when we're not sure what to do about reltupledensity. That would help Florian's problem quite a lot, even if we did nothing else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > I think we should really consider replacing reltuples with > reltupledensity at some point. I continue to be afraid that using > a decaying average in this case is going to end up overweighting > the values from some portion of the table that's getting scanned > repeatedly, at the expense of other portions of the table that are > not getting scanned at all. Now, perhaps experience will prove > that's not a problem. But storing relpages and reltupledensity > separately would give us more flexibility, because we could feel > free to bump relpages even when we're not sure what to do about > reltupledensity. That would help Florian's problem quite a lot, > even if we did nothing else. Given how trivial it would be to adjust reltuples to keep its ratio to relpages about the same when we don't have a new "hard" number, but some evidence that we should fudge our previous value, I don't see where this change buys us much. It seems to mostly obfuscate the fact that we're changing our assumption about how many tuples we have. I would rather that we did that explicitly with code comments about why it's justified than to slip it in the way you suggest. -Kevin
On Thu, May 26, 2011 at 1:28 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> I think we should really consider replacing reltuples with >> reltupledensity at some point. I continue to be afraid that using >> a decaying average in this case is going to end up overweighting >> the values from some portion of the table that's getting scanned >> repeatedly, at the expense of other portions of the table that are >> not getting scanned at all. Now, perhaps experience will prove >> that's not a problem. But storing relpages and reltupledensity >> separately would give us more flexibility, because we could feel >> free to bump relpages even when we're not sure what to do about >> reltupledensity. That would help Florian's problem quite a lot, >> even if we did nothing else. > > Given how trivial it would be to adjust reltuples to keep its ratio > to relpages about the same when we don't have a new "hard" number, > but some evidence that we should fudge our previous value, I don't > see where this change buys us much. It seems to mostly obfuscate > the fact that we're changing our assumption about how many tuples we > have. I would rather that we did that explicitly with code comments > about why it's justified than to slip it in the way you suggest. I'm a bit confused by this - what the current design obfuscates is the fact that reltuples and relpages are not really independent columns; you can't update one without updating the other, unless you want screwy behavior. Replacing reltuples by reltupledensity would fix that problem - it would be logical and non-damaging to update either column independently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> Given how trivial it would be to adjust reltuples to keep its >> ratio to relpages about the same when we don't have a new "hard" >> number, but some evidence that we should fudge our previous >> value, I don't see where this change buys us much. It seems to >> mostly obfuscate the fact that we're changing our assumption >> about how many tuples we have. I would rather that we did that >> explicitly with code comments about why it's justified than to >> slip it in the way you suggest. > > I'm a bit confused by this - what the current design obfuscates is > the fact that reltuples and relpages are not really independent > columns; you can't update one without updating the other, unless > you want screwy behavior. Replacing reltuples by reltupledensity > would fix that problem - it would be logical and non-damaging to > update either column independently. They don't always move in tandem. Certainly there can be available space in those pages from which tuples can be allocated or which increases as tuples are vacuumed. Your proposed change would neither make more or less information available, because we've got two numbers which can be observed as raw counts, and a ratio between them. By storing the ratio and one count you make changes to the other count implied and less visible. It seems more understandable and less prone to error (to me, anyway) to keep the two "raw" numbers and calculate the ratio -- and when you observe a change in one raw number which you believe should force an adjustment to the other raw number before its next actual value is observed, to comment on why that's a good idea, and do the trivial arithmetic at that time. As a thought exercise, what happens each way if a table is loaded with a low fillfactor and then a lot of inserts are done? What happens if mass deletes are done from a table which has a high density? -Kevin
On Thu, May 26, 2011 at 2:05 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> I'm a bit confused by this - what the current design obfuscates is >> the fact that reltuples and relpages are not really independent >> columns; you can't update one without updating the other, unless >> you want screwy behavior. Replacing reltuples by reltupledensity >> would fix that problem - it would be logical and non-damaging to >> update either column independently. > > They don't always move in tandem. Certainly there can be available > space in those pages from which tuples can be allocated or which > increases as tuples are vacuumed. Your proposed change would > neither make more or less information available, because we've got > two numbers which can be observed as raw counts, and a ratio between > them. So far I agree. > By storing the ratio and one count you make changes to the > other count implied and less visible. It seems more understandable > and less prone to error (to me, anyway) to keep the two "raw" > numbers and calculate the ratio -- and when you observe a change in > one raw number which you believe should force an adjustment to the > other raw number before its next actual value is observed, to > comment on why that's a good idea, and do the trivial arithmetic at > that time. Except that's not how it works. At least in the case of ANALYZE, we *aren't* counting all the tuples in the table. We're selecting a random sample of pages and inferring a tuple density, which we then extrapolate to the whole table and store. Then when we pull it back out of the table, we convert it back to a tuple density. The real value we are computing and using almost everywhere is tuple density; storing a total number of tuples in the table appears to be just confusing the issue. Unless, of course, I am misunderstanding, which is possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Except that's not how it works. At least in the case of ANALYZE, we > *aren't* counting all the tuples in the table. We're selecting a > random sample of pages and inferring a tuple density, which we then > extrapolate to the whole table and store. Then when we pull it back > out of the table, we convert it back to a tuple density. The real > value we are computing and using almost everywhere is tuple density; > storing a total number of tuples in the table appears to be just > confusing the issue. If we were starting in a green field we might choose to store tuple density. However, the argument for changing it now is at best mighty thin; IMO it is not worth the risk of breaking client code. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Another thought: Couldn't relation_needs_vacanalyze() just scale up >>> reltuples by the ratio of the current number of pages in the relation >>> to relpages, just as the query planner does? >> Hmm ... that would fix Florian's immediate issue, and it does seem like >> a good change on its own merits. �But it does nothing for the problem >> that we're failing to put the best available information into pg_class. >> >> Possibly we could compromise on doing just that much in the back >> branches, and the larger change for 9.1? > Do you think we need to worry about the extra overhead of determining > the current size of every relation as we sweep through pg_class? It's > not a lot, but OTOH I think we'd be doing it once a minute... not sure > what would happen if there were tons of tables. Ugh ... that is a mighty good point, since the RelationGetNumberOfBlocks call would have to happen for each table, even the ones we then decide not to vacuum. We've already seen people complain about the cost of the AV launcher once they have a lot of databases, and this would probably increase it quite a bit. > Going back to your thought upthread, I think we should really consider > replacing reltuples with reltupledensity at some point. I continue to > be afraid that using a decaying average in this case is going to end > up overweighting the values from some portion of the table that's > getting scanned repeatedly, at the expense of other portions of the > table that are not getting scanned at all. Changing the representation of the information would change that issue not in the slightest. The fundamental point here is that we have new, possibly partial, information which we ought to somehow merge with the old, also possibly partial, information. Storing the data a little bit differently doesn't magically eliminate that issue. regards, tom lane
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> By storing the ratio and one count you make changes to the >> other count implied and less visible. It seems more >> understandable and less prone to error (to me, anyway) to keep >> the two "raw" numbers and calculate the ratio -- and when you >> observe a change in one raw number which you believe should force >> an adjustment to the other raw number before its next actual >> value is observed, to comment on why that's a good idea, and do >> the trivial arithmetic at that time. > > Except that's not how it works. At least in the case of ANALYZE, > we *aren't* counting all the tuples in the table. We're selecting > a random sample of pages and inferring a tuple density, which we > then extrapolate to the whole table and store. Then when we pull > it back out of the table, we convert it back to a tuple density. > The real value we are computing and using almost everywhere is > tuple density; storing a total number of tuples in the table > appears to be just confusing the issue. Well, if tuple density is the number which is most heavily used, it might shave a few nanoseconds doing the arithmetic in enough places to justify the change, but I'm skeptical. Basically I'm with Tom on the fact that this change would store neither more nor less information (and for that matter would not really change what information you can easily retrieve); and slightly changing the manner in which it is stored doesn't solve any of the problems you assert that it does. When we prune or vacuum a page, I don't suppose we have enough information about that page's previous state to calculate a tuple count delta, do we? That would allow a far more accurate number to be maintained than anything suggested so far, as long as we tweak autovacuum to count inserts toward the need to vacuum. (It seems to me I saw a post giving some reason that would have benefits anyway.) Except for the full pass during transaction wrap-around protection, where it could just set a new actual count, autovacuum would be skipping pages where the bit is set to indicate that all tuples are visible, right? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > When we prune or vacuum a page, I don't suppose we have enough > information about that page's previous state to calculate a tuple > count delta, do we? That would allow a far more accurate number to > be maintained than anything suggested so far, as long as we tweak > autovacuum to count inserts toward the need to vacuum. Well, that was the other direction that was suggested upthread: stop relying on reltuples at all, but use the stats collector's counts. That might be a good solution in the long run, but there are some issues: 1. It's not clear how using a current count, as opposed to time-of-last-vacuum count, would affect the behavior of the autovacuum control logic. At first glance I think it would break it, since the basic logic there is "how much of the table changed since it was last vacuumed?". Even if the equations could be modified to still work, I remember enough feedback control theory from undergrad EE to think that this is something to be seriously scared of tweaking without extensive testing. IMO it is far more risky than what Robert is worried about. 2. You still have the problem that we're exposing inaccurate (or at least less accurate than they could be) counts to the planner and to onlooker clients. We could change the planner to also depend on the stats collector instead of reltuples, but at that point you just removed the option for people to turn off the stats collector. The implications for plan stability might be unpleasant, too. So that's not a direction I want to go without a significant amount of work and testing. regards, tom lane
On Thu, May 26, 2011 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> When we prune or vacuum a page, I don't suppose we have enough >> information about that page's previous state to calculate a tuple >> count delta, do we? That would allow a far more accurate number to >> be maintained than anything suggested so far, as long as we tweak >> autovacuum to count inserts toward the need to vacuum. > > Well, that was the other direction that was suggested upthread: stop > relying on reltuples at all, but use the stats collector's counts. > That might be a good solution in the long run, but there are some > issues: > > 1. It's not clear how using a current count, as opposed to > time-of-last-vacuum count, would affect the behavior of the autovacuum > control logic. At first glance I think it would break it, since the > basic logic there is "how much of the table changed since it was last > vacuumed?". Even if the equations could be modified to still work, > I remember enough feedback control theory from undergrad EE to think that > this is something to be seriously scared of tweaking without extensive > testing. IMO it is far more risky than what Robert is worried about. Yeah, I think that would be broken. > 2. You still have the problem that we're exposing inaccurate (or at > least less accurate than they could be) counts to the planner and to > onlooker clients. We could change the planner to also depend on the > stats collector instead of reltuples, but at that point you just removed > the option for people to turn off the stats collector. The implications > for plan stability might be unpleasant, too. > > So that's not a direction I want to go without a significant amount > of work and testing. FWIW, I agree. Your proposed solution is certainly better than trying to do this; but it still seems a bit shaky to me. Still, maybe we don't have a better option. If it were me, I'd add an additional safety valve: use your formula if the percentage of the relation scanned is above some threshold where there's unlikely to be too much skew. But if the percentage scanned is too small, then don't use that formula. Instead, only update relpages/reltuples if the relation is now larger; set relpages to the new actual value, and scale up reltuples proportionately. However, I just work here. It's possible that I'm worrying about a problem that won't materialize in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Still, maybe we don't have a better option. If it were me, I'd add an > additional safety valve: use your formula if the percentage of the > relation scanned is above some threshold where there's unlikely to be > too much skew. But if the percentage scanned is too small, then don't > use that formula. Instead, only update relpages/reltuples if the > relation is now larger; set relpages to the new actual value, and > scale up reltuples proportionately. Ah, progress: now we're down to arguing about the size of the fudge factor ;-). I'll do something involving derating the reliability when the number is coming from VACUUM. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Still, maybe we don't have a better option. If it were me, I'd add an > additional safety valve: use your formula if the percentage of the > relation scanned is above some threshold where there's unlikely to be > too much skew. But if the percentage scanned is too small, then don't > use that formula. Instead, only update relpages/reltuples if the > relation is now larger; set relpages to the new actual value, and > scale up reltuples proportionately. > However, I just work here. It's possible that I'm worrying about a > problem that won't materialize in practice. Attached is a proposed patch to fix these issues. Experimentation convinced me that including a fudge factor for VACUUM's results made things *less* accurate, not more so. The reason seems to be bound up in Greg Stark's observation that the unmodified calculation is equivalent to assuming that the old average tuple density still applies to the unscanned pages. In a VACUUM, we know that the unscanned pages are exactly those that have had no changes since (at least) the last vacuum, which means that indeed the old density ought to be a good estimate. Now, this reasoning can break down if the table's tuple density is nonuniform, but what I found in my testing is that if you vacuum after a significant change in a table (such as deleting a lot of rows), and you don't apply the full unfudged correction, you get a badly wrong result. I think that's a more significant issue than the possibility of drift over time. I also found that Greg was right in thinking that it would help if we tweaked lazy_scan_heap to not always scan the first SKIP_PAGES_THRESHOLD-1 pages even if they were all_visible_according_to_vm. That seemed to skew the results if those pages weren't representative. And, for the case of a useless manual vacuum on a completely clean table, it would cause the reltuples value to drift when there was no reason to change it at all. Lastly, this patch removes a bunch of grotty interconnections between VACUUM and ANALYZE that were meant to prevent ANALYZE from updating the stats if VACUUM had done a full-table scan in the same command. With the new logic it's relatively harmless if ANALYZE does that, and anyway autovacuum frequently fires the two cases independently anyway, making all that logic quite useless in the normal case. (This simplification accounts for the bulk of the diff, actually.) Comments? regards, tom lane diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 0568a1bcf86281a9b1086d343e7027557295065c..fa84989fc6fa8be90d4eecb9c33e94a232d79880 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *************** static MemoryContext anl_context = NULL; *** 84,91 **** static BufferAccessStrategy vac_strategy; ! static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, ! bool update_reltuples, bool inh); static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize); static bool BlockSampler_HasMore(BlockSampler bs); --- 84,90 ---- static BufferAccessStrategy vac_strategy; ! static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh); static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize); static bool BlockSampler_HasMore(BlockSampler bs); *************** static bool std_typanalyze(VacAttrStats *** 115,132 **** /* * analyze_rel() -- analyze one relation - * - * If update_reltuples is true, we update reltuples and relpages columns - * in pg_class. Caller should pass false if we're part of VACUUM ANALYZE, - * and the VACUUM didn't skip any pages. We only have an approximate count, - * so we don't want to overwrite the accurate values already inserted by the - * VACUUM in that case. VACUUM always scans all indexes, however, so the - * pg_class entries for indexes are never updated if we're part of VACUUM - * ANALYZE. */ void ! analyze_rel(Oid relid, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy, bool update_reltuples) { Relation onerel; --- 114,122 ---- /* * analyze_rel() -- analyze one relation */ void ! analyze_rel(Oid relid, VacuumStmt *vacstmt, BufferAccessStrategy bstrategy) { Relation onerel; *************** analyze_rel(Oid relid, VacuumStmt *vacst *** 238,250 **** /* * Do the normal non-recursive ANALYZE. */ ! do_analyze_rel(onerel, vacstmt, update_reltuples, false); /* * If there are child tables, do recursive ANALYZE. */ if (onerel->rd_rel->relhassubclass) ! do_analyze_rel(onerel, vacstmt, false, true); /* * Close source relation now, but keep lock so that no one deletes it --- 228,240 ---- /* * Do the normal non-recursive ANALYZE. */ ! do_analyze_rel(onerel, vacstmt, false); /* * If there are child tables, do recursive ANALYZE. */ if (onerel->rd_rel->relhassubclass) ! do_analyze_rel(onerel, vacstmt, true); /* * Close source relation now, but keep lock so that no one deletes it *************** analyze_rel(Oid relid, VacuumStmt *vacst *** 267,274 **** * do_analyze_rel() -- analyze one relation, recursively or not */ static void ! do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, ! bool update_reltuples, bool inh) { int attr_cnt, tcnt, --- 257,263 ---- * do_analyze_rel() -- analyze one relation, recursively or not */ static void ! do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh) { int attr_cnt, tcnt, *************** do_analyze_rel(Relation onerel, VacuumSt *** 437,445 **** } /* ! * Quit if no analyzable columns and no pg_class update needed. */ ! if (attr_cnt <= 0 && !analyzableindex && !update_reltuples) goto cleanup; /* --- 426,434 ---- } /* ! * Quit if no analyzable columns. */ ! if (attr_cnt <= 0 && !analyzableindex) goto cleanup; /* *************** do_analyze_rel(Relation onerel, VacuumSt *** 549,558 **** } /* ! * Update pages/tuples stats in pg_class, but not if we're inside a VACUUM ! * that got a more precise number. */ ! if (update_reltuples) vac_update_relstats(onerel, RelationGetNumberOfBlocks(onerel), totalrows, hasindex, InvalidTransactionId); --- 538,547 ---- } /* ! * Update pages/tuples stats in pg_class ... but not if we're doing ! * inherited stats. */ ! if (!inh) vac_update_relstats(onerel, RelationGetNumberOfBlocks(onerel), totalrows, hasindex, InvalidTransactionId); *************** do_analyze_rel(Relation onerel, VacuumSt *** 562,568 **** * VACUUM ANALYZE, don't overwrite the accurate count already inserted by * VACUUM. */ ! if (!(vacstmt->options & VACOPT_VACUUM)) { for (ind = 0; ind < nindexes; ind++) { --- 551,557 ---- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by * VACUUM. */ ! if (!inh && !(vacstmt->options & VACOPT_VACUUM)) { for (ind = 0; ind < nindexes; ind++) { *************** do_analyze_rel(Relation onerel, VacuumSt *** 577,589 **** } /* ! * Report ANALYZE to the stats collector, too; likewise, tell it to adopt ! * these numbers only if we're not inside a VACUUM that got a better ! * number. However, a call with inh = true shouldn't reset the stats. */ if (!inh) ! pgstat_report_analyze(onerel, update_reltuples, ! totalrows, totaldeadrows); /* We skip to here if there were no analyzable columns */ cleanup: --- 566,577 ---- } /* ! * Report ANALYZE to the stats collector, too. However, if doing ! * inherited stats we shouldn't report, because the stats collector only ! * tracks per-table stats. */ if (!inh) ! pgstat_report_analyze(onerel, totalrows, totaldeadrows); /* We skip to here if there were no analyzable columns */ cleanup: *************** acquire_sample_rows(Relation onerel, Hea *** 1243,1260 **** qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows); /* ! * Estimate total numbers of rows in relation. */ if (bs.m > 0) ! { ! *totalrows = floor((liverows * totalblocks) / bs.m + 0.5); ! *totaldeadrows = floor((deadrows * totalblocks) / bs.m + 0.5); ! } else - { - *totalrows = 0.0; *totaldeadrows = 0.0; - } /* * Emit some interesting relation info --- 1231,1249 ---- qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows); /* ! * Estimate total numbers of rows in relation. For live rows, use ! * vac_estimate_reltuples; for dead rows, we have no source of old ! * information, so we have to assume the density is the same in unseen ! * pages as in the pages we scanned. */ + *totalrows = vac_estimate_reltuples(onerel, true, + totalblocks, + bs.m, + liverows); if (bs.m > 0) ! *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5); else *totaldeadrows = 0.0; /* * Emit some interesting relation info diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 9606569617afafe16c55752183cb2a6de89bcad1..efae31acab7de0f6d4cbe45fb175a301486ce582 100644 *** a/src/backend/commands/vacuum.c --- b/src/backend/commands/vacuum.c *************** *** 20,25 **** --- 20,27 ---- */ #include "postgres.h" + #include <math.h> + #include "access/clog.h" #include "access/genam.h" #include "access/heapam.h" *************** static BufferAccessStrategy vac_strategy *** 62,68 **** static List *get_rel_oids(Oid relid, const RangeVar *vacrel); static void vac_truncate_clog(TransactionId frozenXID); static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, ! bool for_wraparound, bool *scanned_all); /* --- 64,70 ---- static List *get_rel_oids(Oid relid, const RangeVar *vacrel); static void vac_truncate_clog(TransactionId frozenXID); static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, ! bool for_wraparound); /* *************** vacuum(VacuumStmt *vacstmt, Oid relid, b *** 219,230 **** foreach(cur, relations) { Oid relid = lfirst_oid(cur); - bool scanned_all = false; if (vacstmt->options & VACOPT_VACUUM) { ! if (!vacuum_rel(relid, vacstmt, do_toast, for_wraparound, ! &scanned_all)) continue; } --- 221,230 ---- foreach(cur, relations) { Oid relid = lfirst_oid(cur); if (vacstmt->options & VACOPT_VACUUM) { ! if (!vacuum_rel(relid, vacstmt, do_toast, for_wraparound)) continue; } *************** vacuum(VacuumStmt *vacstmt, Oid relid, b *** 241,247 **** PushActiveSnapshot(GetTransactionSnapshot()); } ! analyze_rel(relid, vacstmt, vac_strategy, !scanned_all); if (use_own_xacts) { --- 241,247 ---- PushActiveSnapshot(GetTransactionSnapshot()); } ! analyze_rel(relid, vacstmt, vac_strategy); if (use_own_xacts) { *************** vacuum_set_xid_limits(int freeze_min_age *** 454,459 **** --- 454,537 ---- /* + * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples + * + * If we scanned the whole relation then we should just use the count of + * live tuples seen; but if we did not, we should not trust the count + * unreservedly, especially not in VACUUM, which may have scanned a quite + * nonrandom subset of the table. When we have only partial information, + * we take the old value of pg_class.reltuples as a measurement of the + * tuple density in the unscanned pages. + * + * This routine is shared by VACUUM and ANALYZE. + */ + double + vac_estimate_reltuples(Relation relation, bool is_analyze, + BlockNumber total_pages, + BlockNumber scanned_pages, + double scanned_tuples) + { + BlockNumber old_rel_pages = relation->rd_rel->relpages; + double old_rel_tuples = relation->rd_rel->reltuples; + double old_density; + double new_density; + double multiplier; + double updated_density; + + /* If we did scan the whole table, just use the count as-is */ + if (scanned_pages >= total_pages) + return scanned_tuples; + + /* + * If scanned_pages is zero but total_pages isn't, keep the existing + * value of reltuples. + */ + if (scanned_pages == 0) + return old_rel_tuples; + + /* + * If old value of relpages is zero, old density is indeterminate; we + * can't do much except scale up scanned_tuples to match total_pages. + */ + if (old_rel_pages == 0) + return floor((scanned_tuples / scanned_pages) * total_pages + 0.5); + + /* + * Okay, we've covered the corner cases. The normal calculation is to + * convert the old measurement to a density (tuples per page), then + * update the density using an exponential-moving-average approach, + * and finally compute reltuples as updated_density * total_pages. + * + * For ANALYZE, the moving average multiplier is just the fraction of + * the table's pages we scanned. This is equivalent to assuming + * that the tuple density in the unscanned pages didn't change. Of + * course, it probably did, if the new density measurement is different. + * But over repeated cycles, the value of reltuples will converge towards + * the correct value, if repeated measurements show the same new density. + * + * For VACUUM, the situation is a bit different: we have looked at a + * nonrandom sample of pages, but we know for certain that the pages we + * didn't look at are precisely the ones that haven't changed lately. + * Thus, there is a reasonable argument for doing exactly the same thing + * as for the ANALYZE case, that is use the old density measurement as + * the value for the unscanned pages. + * + * This logic could probably use further refinement. + */ + old_density = old_rel_tuples / old_rel_pages; + new_density = scanned_tuples / scanned_pages; + multiplier = (double) scanned_pages / (double) total_pages; + #ifdef NOT_USED + /* this makes things less accurate, not more so :-( */ + if (!is_analyze && multiplier < 0.5) + multiplier *= 0.25; /* apply fudge factor */ + #endif + updated_density = old_density + (new_density - old_density) * multiplier; + return floor(updated_density * total_pages + 0.5); + } + + + /* * vac_update_relstats() -- update statistics for one relation * * Update the whole-relation statistics that are kept in its pg_class *************** vacuum_set_xid_limits(int freeze_min_age *** 480,486 **** * somebody vacuuming pg_class might think they could delete a tuple * marked with xmin = our xid. * ! * This routine is shared by VACUUM and stand-alone ANALYZE. */ void vac_update_relstats(Relation relation, --- 558,564 ---- * somebody vacuuming pg_class might think they could delete a tuple * marked with xmin = our xid. * ! * This routine is shared by VACUUM and ANALYZE. */ void vac_update_relstats(Relation relation, *************** vac_truncate_clog(TransactionId frozenXI *** 758,771 **** * many small transactions. Otherwise, two-phase locking would require * us to lock the entire database during one pass of the vacuum cleaner. * - * We'll return true in *scanned_all if the vacuum scanned all heap - * pages, and updated pg_class. - * * At entry and exit, we are not inside a transaction. */ static bool ! vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound, ! bool *scanned_all) { LOCKMODE lmode; Relation onerel; --- 836,845 ---- * many small transactions. Otherwise, two-phase locking would require * us to lock the entire database during one pass of the vacuum cleaner. * * At entry and exit, we are not inside a transaction. */ static bool ! vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound) { LOCKMODE lmode; Relation onerel; *************** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 775,783 **** int save_sec_context; int save_nestlevel; - if (scanned_all) - *scanned_all = false; - /* Begin a transaction for vacuuming this relation */ StartTransactionCommand(); --- 849,854 ---- *************** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 971,977 **** vacstmt->freeze_min_age, vacstmt->freeze_table_age); } else ! lazy_vacuum_rel(onerel, vacstmt, vac_strategy, scanned_all); /* Roll back any GUC changes executed by index functions */ AtEOXact_GUC(false, save_nestlevel); --- 1042,1048 ---- vacstmt->freeze_min_age, vacstmt->freeze_table_age); } else ! lazy_vacuum_rel(onerel, vacstmt, vac_strategy); /* Roll back any GUC changes executed by index functions */ AtEOXact_GUC(false, save_nestlevel); *************** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 997,1003 **** * totally unimportant for toast relations. */ if (toast_relid != InvalidOid) ! vacuum_rel(toast_relid, vacstmt, false, for_wraparound, NULL); /* * Now release the session-level lock on the master table. --- 1068,1074 ---- * totally unimportant for toast relations. */ if (toast_relid != InvalidOid) ! vacuum_rel(toast_relid, vacstmt, false, for_wraparound); /* * Now release the session-level lock on the master table. diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 9393fa0727aaad7508e1163623322b4066412257..570f1032f3d4eba46ed54de3f6cd39d62e6d09f0 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *************** *** 77,93 **** * Before we consider skipping a page that's marked as clean in * visibility map, we must've seen at least this many clean pages. */ ! #define SKIP_PAGES_THRESHOLD 32 typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ bool hasindex; - bool scanned_all; /* have we scanned all pages (this far)? */ /* Overall statistics about rel */ ! BlockNumber rel_pages; double old_rel_tuples; /* previous value of pg_class.reltuples */ ! double rel_tuples; /* counts only tuples on scanned pages */ BlockNumber pages_removed; double tuples_deleted; BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ --- 77,94 ---- * Before we consider skipping a page that's marked as clean in * visibility map, we must've seen at least this many clean pages. */ ! #define SKIP_PAGES_THRESHOLD ((BlockNumber) 32) typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ bool hasindex; /* Overall statistics about rel */ ! BlockNumber rel_pages; /* total number of pages */ ! BlockNumber scanned_pages; /* number of pages we examined */ ! double scanned_tuples; /* counts only tuples on scanned pages */ double old_rel_tuples; /* previous value of pg_class.reltuples */ ! double new_rel_tuples; /* new estimated total # of tuples */ BlockNumber pages_removed; double tuples_deleted; BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ *************** static int vac_cmp_itemptr(const void *l *** 143,149 **** */ void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy, bool *scanned_all) { LVRelStats *vacrelstats; Relation *Irel; --- 144,150 ---- */ void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy) { LVRelStats *vacrelstats; Relation *Irel; *************** lazy_vacuum_rel(Relation onerel, VacuumS *** 175,181 **** vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats)); - vacrelstats->scanned_all = true; /* will be cleared if we skip a page */ vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples; vacrelstats->num_index_scans = 0; --- 176,181 ---- *************** lazy_vacuum_rel(Relation onerel, VacuumS *** 205,228 **** FreeSpaceMapVacuum(onerel); /* ! * Update statistics in pg_class. But only if we didn't skip any pages; ! * the tuple count only includes tuples from the pages we've visited, and ! * we haven't frozen tuples in unvisited pages either. The page count is ! * accurate in any case, but because we use the reltuples / relpages ratio ! * in the planner, it's better to not update relpages either if we can't ! * update reltuples. */ ! if (vacrelstats->scanned_all) ! vac_update_relstats(onerel, ! vacrelstats->rel_pages, vacrelstats->rel_tuples, ! vacrelstats->hasindex, ! FreezeLimit); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, ! vacrelstats->scanned_all, ! vacrelstats->rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0) --- 205,224 ---- FreeSpaceMapVacuum(onerel); /* ! * Update statistics in pg_class. But don't change relfrozenxid if we ! * skipped any pages. */ ! vac_update_relstats(onerel, ! vacrelstats->rel_pages, vacrelstats->new_rel_tuples, ! vacrelstats->hasindex, ! (vacrelstats->scanned_pages < vacrelstats->rel_pages) ? ! InvalidTransactionId : ! FreezeLimit); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, ! vacrelstats->new_rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0) *************** lazy_vacuum_rel(Relation onerel, VacuumS *** 239,251 **** get_namespace_name(RelationGetNamespace(onerel)), RelationGetRelationName(onerel), vacrelstats->num_index_scans, ! vacrelstats->pages_removed, vacrelstats->rel_pages, ! vacrelstats->tuples_deleted, vacrelstats->rel_tuples, pg_rusage_show(&ru0)))); } - - if (scanned_all) - *scanned_all = vacrelstats->scanned_all; } /* --- 235,246 ---- get_namespace_name(RelationGetNamespace(onerel)), RelationGetRelationName(onerel), vacrelstats->num_index_scans, ! vacrelstats->pages_removed, ! vacrelstats->rel_pages, ! vacrelstats->tuples_deleted, ! vacrelstats->new_rel_tuples, pg_rusage_show(&ru0)))); } } /* *************** lazy_scan_heap(Relation onerel, LVRelSta *** 301,307 **** HeapTupleData tuple; char *relname; BlockNumber empty_pages, - scanned_pages, vacuumed_pages; double num_tuples, tups_vacuumed, --- 296,301 ---- *************** lazy_scan_heap(Relation onerel, LVRelSta *** 321,327 **** get_namespace_name(RelationGetNamespace(onerel)), relname))); ! empty_pages = vacuumed_pages = scanned_pages = 0; num_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) --- 315,321 ---- get_namespace_name(RelationGetNamespace(onerel)), relname))); ! empty_pages = vacuumed_pages = 0; num_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) *************** lazy_scan_heap(Relation onerel, LVRelSta *** 329,340 **** --- 323,363 ---- nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; + vacrelstats->scanned_pages = 0; vacrelstats->nonempty_pages = 0; vacrelstats->latestRemovedXid = InvalidTransactionId; lazy_space_alloc(vacrelstats, nblocks); + /* + * We want to skip pages that don't require vacuuming according to the + * visibility map, but only if we've seen a streak of at least + * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading + * sequentially, the OS should be doing readahead for us and there's no + * gain in skipping a page now and then. You need a longer run of + * consecutive skipped pages before it's worthwhile. Also, skipping even + * a single page means that we can't update relfrozenxid, so we only want + * to do it if there's a good chance to skip a goodly number of pages. + * + * Before entering the main loop, look ahead to see if there are at least + * SKIP_PAGES_THRESHOLD clean pages at the start of the relation; if so, + * we are going to skip at least one page, and we might as well skip the + * initial pages too. This extra logic prevents over-weighting these + * pages when we estimate the new tuple density in vac_estimate_reltuples. + */ all_visible_streak = 0; + if (!scan_all && nblocks >= SKIP_PAGES_THRESHOLD) + { + for (blkno = 0; blkno < SKIP_PAGES_THRESHOLD; blkno++) + { + if (!visibilitymap_test(onerel, blkno, &vmbuffer)) + break; + all_visible_streak++; + } + if (all_visible_streak < SKIP_PAGES_THRESHOLD) + all_visible_streak = 0; + } + for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; *************** lazy_scan_heap(Relation onerel, LVRelSta *** 351,379 **** bool all_visible; bool has_dead_tuples; - /* - * Skip pages that don't require vacuuming according to the visibility - * map. But only if we've seen a streak of at least - * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading - * sequentially, the OS should be doing readahead for us and there's - * no gain in skipping a page now and then. You need a longer run of - * consecutive skipped pages before it's worthwhile. Also, skipping - * even a single page means that we can't update relfrozenxid or - * reltuples, so we only want to do it if there's a good chance to - * skip a goodly number of pages. - */ if (!scan_all) { all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer); if (all_visible_according_to_vm) { all_visible_streak++; if (all_visible_streak >= SKIP_PAGES_THRESHOLD) - { - vacrelstats->scanned_all = false; continue; - } } else all_visible_streak = 0; --- 374,392 ---- bool all_visible; bool has_dead_tuples; if (!scan_all) { + /* + * Skip pages that don't require vacuuming according to the + * visibility map, as per comment above. + */ all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer); if (all_visible_according_to_vm) { all_visible_streak++; if (all_visible_streak >= SKIP_PAGES_THRESHOLD) continue; } else all_visible_streak = 0; *************** lazy_scan_heap(Relation onerel, LVRelSta *** 381,387 **** vacuum_delay_point(); ! scanned_pages++; /* * If we are close to overrunning the available space for dead-tuple --- 394,400 ---- vacuum_delay_point(); ! vacrelstats->scanned_pages++; /* * If we are close to overrunning the available space for dead-tuple *************** lazy_scan_heap(Relation onerel, LVRelSta *** 764,772 **** } /* save stats for use later */ ! vacrelstats->rel_tuples = num_tuples; vacrelstats->tuples_deleted = tups_vacuumed; /* If any tuples need to be deleted, perform final vacuum cycle */ /* XXX put a threshold on min number of tuples here? */ if (vacrelstats->num_dead_tuples > 0) --- 777,791 ---- } /* save stats for use later */ ! vacrelstats->scanned_tuples = num_tuples; vacrelstats->tuples_deleted = tups_vacuumed; + /* now we can compute the new value for pg_class.reltuples */ + vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, + nblocks, + vacrelstats->scanned_pages, + num_tuples); + /* If any tuples need to be deleted, perform final vacuum cycle */ /* XXX put a threshold on min number of tuples here? */ if (vacrelstats->num_dead_tuples > 0) *************** lazy_scan_heap(Relation onerel, LVRelSta *** 805,811 **** ereport(elevel, (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), ! tups_vacuumed, num_tuples, scanned_pages, nblocks), errdetail("%.0f dead row versions cannot be removed yet.\n" "There were %.0f unused item pointers.\n" "%u pages are entirely empty.\n" --- 824,831 ---- ereport(elevel, (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), ! tups_vacuumed, num_tuples, ! vacrelstats->scanned_pages, nblocks), errdetail("%.0f dead row versions cannot be removed yet.\n" "There were %.0f unused item pointers.\n" "%u pages are entirely empty.\n" *************** lazy_cleanup_index(Relation indrel, *** 977,986 **** ivinfo.index = indrel; ivinfo.analyze_only = false; ! ivinfo.estimated_count = !vacrelstats->scanned_all; ivinfo.message_level = elevel; ! /* use rel_tuples only if we scanned all pages, else fall back */ ! ivinfo.num_heap_tuples = vacrelstats->scanned_all ? vacrelstats->rel_tuples : vacrelstats->old_rel_tuples; ivinfo.strategy = vac_strategy; stats = index_vacuum_cleanup(&ivinfo, stats); --- 997,1005 ---- ivinfo.index = indrel; ivinfo.analyze_only = false; ! ivinfo.estimated_count = (vacrelstats->scanned_pages < vacrelstats->rel_pages); ivinfo.message_level = elevel; ! ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples; ivinfo.strategy = vac_strategy; stats = index_vacuum_cleanup(&ivinfo, stats); *************** lazy_truncate_heap(Relation onerel, LVRe *** 1041,1048 **** new_rel_pages = RelationGetNumberOfBlocks(onerel); if (new_rel_pages != old_rel_pages) { ! /* might as well use the latest news when we update pg_class stats */ ! vacrelstats->rel_pages = new_rel_pages; UnlockRelation(onerel, AccessExclusiveLock); return; } --- 1060,1072 ---- new_rel_pages = RelationGetNumberOfBlocks(onerel); if (new_rel_pages != old_rel_pages) { ! /* ! * Note: we intentionally don't update vacrelstats->rel_pages with ! * the new rel size here. If we did, it would amount to assuming that ! * the new pages are empty, which is unlikely. Leaving the numbers ! * alone amounts to assuming that the new pages have the same tuple ! * density as existing ones, which is less unlikely. ! */ UnlockRelation(onerel, AccessExclusiveLock); return; } *************** lazy_truncate_heap(Relation onerel, LVRe *** 1076,1082 **** */ UnlockRelation(onerel, AccessExclusiveLock); ! /* update statistics */ vacrelstats->rel_pages = new_rel_pages; vacrelstats->pages_removed = old_rel_pages - new_rel_pages; --- 1100,1110 ---- */ UnlockRelation(onerel, AccessExclusiveLock); ! /* ! * Update statistics. Here, it *is* correct to adjust rel_pages without ! * also touching reltuples, since the tuple count wasn't changed by the ! * truncation. ! */ vacrelstats->rel_pages = new_rel_pages; vacrelstats->pages_removed = old_rel_pages - new_rel_pages; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 5ed6e8337c11fdd599a0deb219e1cb0285228a88..1d80c311d879d9cf9009621860cda3ab19c6dea9 100644 *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *************** pgstat_report_autovac(Oid dboid) *** 1246,1253 **** * --------- */ void ! pgstat_report_vacuum(Oid tableoid, bool shared, bool adopt_counts, ! PgStat_Counter tuples) { PgStat_MsgVacuum msg; --- 1246,1252 ---- * --------- */ void ! pgstat_report_vacuum(Oid tableoid, bool shared, PgStat_Counter tuples) { PgStat_MsgVacuum msg; *************** pgstat_report_vacuum(Oid tableoid, bool *** 1257,1263 **** pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM); msg.m_databaseid = shared ? InvalidOid : MyDatabaseId; msg.m_tableoid = tableoid; - msg.m_adopt_counts = adopt_counts; msg.m_autovacuum = IsAutoVacuumWorkerProcess(); msg.m_vacuumtime = GetCurrentTimestamp(); msg.m_tuples = tuples; --- 1256,1261 ---- *************** pgstat_report_vacuum(Oid tableoid, bool *** 1271,1277 **** * -------- */ void ! pgstat_report_analyze(Relation rel, bool adopt_counts, PgStat_Counter livetuples, PgStat_Counter deadtuples) { PgStat_MsgAnalyze msg; --- 1269,1275 ---- * -------- */ void ! pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples) { PgStat_MsgAnalyze msg; *************** pgstat_report_analyze(Relation rel, bool *** 1308,1314 **** pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE); msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId; msg.m_tableoid = RelationGetRelid(rel); - msg.m_adopt_counts = adopt_counts; msg.m_autovacuum = IsAutoVacuumWorkerProcess(); msg.m_analyzetime = GetCurrentTimestamp(); msg.m_live_tuples = livetuples; --- 1306,1311 ---- *************** pgstat_recv_vacuum(PgStat_MsgVacuum *msg *** 4197,4204 **** tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true); ! if (msg->m_adopt_counts) ! tabentry->n_live_tuples = msg->m_tuples; /* Resetting dead_tuples to 0 is an approximation ... */ tabentry->n_dead_tuples = 0; --- 4194,4200 ---- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true); ! tabentry->n_live_tuples = msg->m_tuples; /* Resetting dead_tuples to 0 is an approximation ... */ tabentry->n_dead_tuples = 0; *************** pgstat_recv_analyze(PgStat_MsgAnalyze *m *** 4233,4243 **** tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true); ! if (msg->m_adopt_counts) ! { ! tabentry->n_live_tuples = msg->m_live_tuples; ! tabentry->n_dead_tuples = msg->m_dead_tuples; ! } /* * We reset changes_since_analyze to zero, forgetting any changes that --- 4229,4236 ---- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true); ! tabentry->n_live_tuples = msg->m_live_tuples; ! tabentry->n_dead_tuples = msg->m_dead_tuples; /* * We reset changes_since_analyze to zero, forgetting any changes that diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index 79c9f5d90fb674ca8c778a65ca540ef62bded0af..cfbe0c43924029843f636845d87a09996d706af3 100644 *** a/src/include/commands/vacuum.h --- b/src/include/commands/vacuum.h *************** extern void vacuum(VacuumStmt *vacstmt, *** 142,147 **** --- 142,151 ---- extern void vac_open_indexes(Relation relation, LOCKMODE lockmode, int *nindexes, Relation **Irel); extern void vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode); + extern double vac_estimate_reltuples(Relation relation, bool is_analyze, + BlockNumber total_pages, + BlockNumber scanned_pages, + double scanned_tuples); extern void vac_update_relstats(Relation relation, BlockNumber num_pages, double num_tuples, *************** extern void vacuum_delay_point(void); *** 157,166 **** /* in commands/vacuumlazy.c */ extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy, bool *scanned_all); /* in commands/analyze.c */ extern void analyze_rel(Oid relid, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy, bool update_reltuples); #endif /* VACUUM_H */ --- 161,170 ---- /* in commands/vacuumlazy.c */ extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy); /* in commands/analyze.c */ extern void analyze_rel(Oid relid, VacuumStmt *vacstmt, ! BufferAccessStrategy bstrategy); #endif /* VACUUM_H */ diff --git a/src/include/pgstat.h b/src/include/pgstat.h index f04be95b459b2053c66495979b95f0b697a2af13..5446fa04409ebefc31a3a1f282e7dd1224a5d734 100644 *** a/src/include/pgstat.h --- b/src/include/pgstat.h *************** typedef struct PgStat_MsgVacuum *** 322,328 **** PgStat_MsgHdr m_hdr; Oid m_databaseid; Oid m_tableoid; - bool m_adopt_counts; bool m_autovacuum; TimestampTz m_vacuumtime; PgStat_Counter m_tuples; --- 322,327 ---- *************** typedef struct PgStat_MsgAnalyze *** 339,345 **** PgStat_MsgHdr m_hdr; Oid m_databaseid; Oid m_tableoid; - bool m_adopt_counts; bool m_autovacuum; TimestampTz m_analyzetime; PgStat_Counter m_live_tuples; --- 338,343 ---- *************** extern void pgstat_reset_shared_counters *** 706,714 **** extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type); extern void pgstat_report_autovac(Oid dboid); ! extern void pgstat_report_vacuum(Oid tableoid, bool shared, bool adopt_counts, PgStat_Counter tuples); ! extern void pgstat_report_analyze(Relation rel, bool adopt_counts, PgStat_Counter livetuples, PgStat_Counter deadtuples); extern void pgstat_report_recovery_conflict(int reason); --- 704,712 ---- extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type); extern void pgstat_report_autovac(Oid dboid); ! extern void pgstat_report_vacuum(Oid tableoid, bool shared, PgStat_Counter tuples); ! extern void pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples); extern void pgstat_report_recovery_conflict(int reason);
On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I also found that Greg was right in thinking that it would help if we > tweaked lazy_scan_heap to not always scan the first > SKIP_PAGES_THRESHOLD-1 pages even if they were > all_visible_according_to_vm. That seemed to skew the results if those > pages weren't representative. And, for the case of a useless manual > vacuum on a completely clean table, it would cause the reltuples value > to drift when there was no reason to change it at all. You fixed the logic only for the first 32 pages which helps with the skew. But really the logic is backwards in general. Instead of counting how many missed opportunities for skipped pages we've seen in the past we should read the bits for the next 32 pages in advance and decide what to do before we read those pages. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I also found that Greg was right in thinking that it would help if we >> tweaked lazy_scan_heap to not always scan the first >> SKIP_PAGES_THRESHOLD-1 pages even if they were >> all_visible_according_to_vm. > You fixed the logic only for the first 32 pages which helps with the > skew. But really the logic is backwards in general. Instead of > counting how many missed opportunities for skipped pages we've seen in > the past we should read the bits for the next 32 pages in advance and > decide what to do before we read those pages. OK, do you like the attached version of that logic? (Other fragments of the patch as before.) regards, tom lane diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *************** lazy_scan_heap(Relation onerel, LVRelSta *** 311,317 **** int i; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; ! BlockNumber all_visible_streak; pg_rusage_init(&ru0); --- 305,312 ---- int i; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; ! BlockNumber next_not_all_visible_block; ! bool skipping_all_visible_blocks; pg_rusage_init(&ru0); *************** lazy_scan_heap(Relation onerel, LVRelSta *** 329,340 **** nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; vacrelstats->nonempty_pages= 0; vacrelstats->latestRemovedXid = InvalidTransactionId; lazy_space_alloc(vacrelstats,nblocks); ! all_visible_streak = 0; for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; --- 324,369 ---- nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; + vacrelstats->scanned_pages = 0; vacrelstats->nonempty_pages = 0; vacrelstats->latestRemovedXid = InvalidTransactionId; lazy_space_alloc(vacrelstats, nblocks); ! /* ! * We want to skip pages that don't require vacuuming according to the ! * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD ! * consecutive pages. Since we're reading sequentially, the OS should be ! * doing readahead for us, so there's no gain in skipping a page now and ! * then; that's likely to disable readahead and so be counterproductive. ! * Also, skipping even a single page means that we can't update ! * relfrozenxid, so we only want to do it if we can skip a goodly number ! * of pages. ! * ! * Before entering the main loop, establish the invariant that ! * next_not_all_visible_block is the next block number >= blkno that's ! * not all-visible according to the visibility map, or nblocks if there's ! * no such block. Also, we set up the skipping_all_visible_blocks flag, ! * which is needed because we need hysteresis in the decision: once we've ! * started skipping blocks, we may as well skip everything up to the next ! * not-all-visible block. ! * ! * Note: if scan_all is true, we won't actually skip any pages; but we ! * maintain next_not_all_visible_block anyway, so as to set up the ! * all_visible_according_to_vm flag correctly for each page. ! */ ! for (next_not_all_visible_block = 0; ! next_not_all_visible_block < nblocks; ! next_not_all_visible_block++) ! { ! if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer)) ! break; ! } ! if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD) ! skipping_all_visible_blocks = true; ! else ! skipping_all_visible_blocks = false; ! for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; *************** lazy_scan_heap(Relation onerel, LVRelSta *** 347,387 **** OffsetNumber frozen[MaxOffsetNumber]; int nfrozen; Size freespace; ! bool all_visible_according_to_vm = false; bool all_visible; bool has_dead_tuples; ! /* ! * Skip pages that don't require vacuuming according to the visibility ! * map. But only if we've seen a streak of at least ! * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading ! * sequentially, the OS should be doing readahead for us and there's ! * no gain in skipping a page now and then. You need a longer run of ! * consecutive skipped pages before it's worthwhile. Also, skipping ! * even a single page means that we can't update relfrozenxid or ! * reltuples, so we only want to do it if there's a good chance to ! * skip a goodly number of pages. ! */ ! if (!scan_all) { ! all_visible_according_to_vm = ! visibilitymap_test(onerel, blkno, &vmbuffer); ! if (all_visible_according_to_vm) { ! all_visible_streak++; ! if (all_visible_streak >= SKIP_PAGES_THRESHOLD) ! { ! vacrelstats->scanned_all = false; ! continue; ! } } else ! all_visible_streak = 0; } vacuum_delay_point(); ! scanned_pages++; /* * If we are close to overrunning the available space for dead-tuple --- 376,419 ---- OffsetNumber frozen[MaxOffsetNumber]; int nfrozen; Size freespace; ! bool all_visible_according_to_vm; bool all_visible; bool has_dead_tuples; ! if (blkno == next_not_all_visible_block) { ! /* Time to advance next_not_all_visible_block */ ! for (next_not_all_visible_block++; ! next_not_all_visible_block < nblocks; ! next_not_all_visible_block++) { ! if (!visibilitymap_test(onerel, next_not_all_visible_block, ! &vmbuffer)) ! break; } + + /* + * We know we can't skip the current block. But set up + * skipping_all_visible_blocks to do the right thing at the + * following blocks. + */ + if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD) + skipping_all_visible_blocks = true; else ! skipping_all_visible_blocks = false; ! all_visible_according_to_vm = false; ! } ! else ! { ! /* Current block is all-visible */ ! if (skipping_all_visible_blocks && !scan_all) ! continue; ! all_visible_according_to_vm = true; } vacuum_delay_point(); ! vacrelstats->scanned_pages++; /* * If we are close to overrunning the available space for dead-tuple
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
Cédric Villemain
Date:
2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>: > Greg Stark <gsstark@mit.edu> writes: >> On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I also found that Greg was right in thinking that it would help if we >>> tweaked lazy_scan_heap to not always scan the first >>> SKIP_PAGES_THRESHOLD-1 pages even if they were >>> all_visible_according_to_vm. > >> You fixed the logic only for the first 32 pages which helps with the >> skew. But really the logic is backwards in general. Instead of >> counting how many missed opportunities for skipped pages we've seen in >> the past we should read the bits for the next 32 pages in advance and >> decide what to do before we read those pages. > > OK, do you like the attached version of that logic? (Other fragments > of the patch as before.) The idea was that remove only one page from the VACUUM will prevent relfrozenxid update and reltuples (and relpages) update. Now, I beleive that once we've skip at least one page thanks to SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as many as possible pages from the VACUUM, tks to the VM. > > regards, tom lane > > diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c > index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644 > *** a/src/backend/commands/vacuumlazy.c > --- b/src/backend/commands/vacuumlazy.c > *************** lazy_scan_heap(Relation onerel, LVRelSta > *** 311,317 **** > int i; > PGRUsage ru0; > Buffer vmbuffer = InvalidBuffer; > ! BlockNumber all_visible_streak; > > pg_rusage_init(&ru0); > > --- 305,312 ---- > int i; > PGRUsage ru0; > Buffer vmbuffer = InvalidBuffer; > ! BlockNumber next_not_all_visible_block; > ! bool skipping_all_visible_blocks; > > pg_rusage_init(&ru0); > > *************** lazy_scan_heap(Relation onerel, LVRelSta > *** 329,340 **** > > nblocks = RelationGetNumberOfBlocks(onerel); > vacrelstats->rel_pages = nblocks; > vacrelstats->nonempty_pages = 0; > vacrelstats->latestRemovedXid = InvalidTransactionId; > > lazy_space_alloc(vacrelstats, nblocks); > > ! all_visible_streak = 0; > for (blkno = 0; blkno < nblocks; blkno++) > { > Buffer buf; > --- 324,369 ---- > > nblocks = RelationGetNumberOfBlocks(onerel); > vacrelstats->rel_pages = nblocks; > + vacrelstats->scanned_pages = 0; > vacrelstats->nonempty_pages = 0; > vacrelstats->latestRemovedXid = InvalidTransactionId; > > lazy_space_alloc(vacrelstats, nblocks); > > ! /* > ! * We want to skip pages that don't require vacuuming according to the > ! * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD > ! * consecutive pages. Since we're reading sequentially, the OS should be > ! * doing readahead for us, so there's no gain in skipping a page now and > ! * then; that's likely to disable readahead and so be counterproductive. > ! * Also, skipping even a single page means that we can't update > ! * relfrozenxid, so we only want to do it if we can skip a goodly number > ! * of pages. > ! * > ! * Before entering the main loop, establish the invariant that > ! * next_not_all_visible_block is the next block number >= blkno that's > ! * not all-visible according to the visibility map, or nblocks if there's > ! * no such block. Also, we set up the skipping_all_visible_blocks flag, > ! * which is needed because we need hysteresis in the decision: once we've > ! * started skipping blocks, we may as well skip everything up to the next > ! * not-all-visible block. > ! * > ! * Note: if scan_all is true, we won't actually skip any pages; but we > ! * maintain next_not_all_visible_block anyway, so as to set up the > ! * all_visible_according_to_vm flag correctly for each page. > ! */ > ! for (next_not_all_visible_block = 0; > ! next_not_all_visible_block < nblocks; > ! next_not_all_visible_block++) > ! { > ! if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer)) > ! break; > ! } > ! if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD) > ! skipping_all_visible_blocks = true; > ! else > ! skipping_all_visible_blocks = false; > ! > for (blkno = 0; blkno < nblocks; blkno++) > { > Buffer buf; > *************** lazy_scan_heap(Relation onerel, LVRelSta > *** 347,387 **** > OffsetNumber frozen[MaxOffsetNumber]; > int nfrozen; > Size freespace; > ! bool all_visible_according_to_vm = false; > bool all_visible; > bool has_dead_tuples; > > ! /* > ! * Skip pages that don't require vacuuming according to the visibility > ! * map. But only if we've seen a streak of at least > ! * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading > ! * sequentially, the OS should be doing readahead for us and there's > ! * no gain in skipping a page now and then. You need a longer run of > ! * consecutive skipped pages before it's worthwhile. Also, skipping > ! * even a single page means that we can't update relfrozenxid or > ! * reltuples, so we only want to do it if there's a good chance to > ! * skip a goodly number of pages. > ! */ > ! if (!scan_all) > { > ! all_visible_according_to_vm = > ! visibilitymap_test(onerel, blkno, &vmbuffer); > ! if (all_visible_according_to_vm) > { > ! all_visible_streak++; > ! if (all_visible_streak >= SKIP_PAGES_THRESHOLD) > ! { > ! vacrelstats->scanned_all = false; > ! continue; > ! } > } > else > ! all_visible_streak = 0; > } > > vacuum_delay_point(); > > ! scanned_pages++; > > /* > * If we are close to overrunning the available space for dead-tuple > --- 376,419 ---- > OffsetNumber frozen[MaxOffsetNumber]; > int nfrozen; > Size freespace; > ! bool all_visible_according_to_vm; > bool all_visible; > bool has_dead_tuples; > > ! if (blkno == next_not_all_visible_block) > { > ! /* Time to advance next_not_all_visible_block */ > ! for (next_not_all_visible_block++; > ! next_not_all_visible_block < nblocks; > ! next_not_all_visible_block++) > { > ! if (!visibilitymap_test(onerel, next_not_all_visible_block, > ! &vmbuffer)) > ! break; > } > + > + /* > + * We know we can't skip the current block. But set up > + * skipping_all_visible_blocks to do the right thing at the > + * following blocks. > + */ > + if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD) > + skipping_all_visible_blocks = true; > else > ! skipping_all_visible_blocks = false; > ! all_visible_according_to_vm = false; > ! } > ! else > ! { > ! /* Current block is all-visible */ > ! if (skipping_all_visible_blocks && !scan_all) > ! continue; > ! all_visible_according_to_vm = true; > } > > vacuum_delay_point(); > > ! vacrelstats->scanned_pages++; > > /* > * If we are close to overrunning the available space for dead-tuple > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Cédric Villemain <cedric.villemain.debian@gmail.com> writes: > 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>: >> OK, do you like the attached version of that logic? �(Other fragments >> of the patch as before.) > The idea was that remove only one page from the VACUUM will prevent > relfrozenxid update and reltuples (and relpages) update. > Now, I beleive that once we've skip at least one page thanks to > SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as > many as possible pages from the VACUUM, tks to the VM. That would require proof, not just suggestion. Skipping pages will defeat the OS read-ahead algorithm, and so could easily cost more than reading them. regards, tom lane
On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Cédric Villemain <cedric.villemain.debian@gmail.com> writes: >> 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>: >>> OK, do you like the attached version of that logic? (Other fragments >>> of the patch as before.) > >> The idea was that remove only one page from the VACUUM will prevent >> relfrozenxid update and reltuples (and relpages) update. >> Now, I beleive that once we've skip at least one page thanks to >> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as >> many as possible pages from the VACUUM, tks to the VM. > > That would require proof, not just suggestion. Skipping pages will > defeat the OS read-ahead algorithm, and so could easily cost more than > reading them. > My worry is what we have right now is also based on just assumptions and gut feelings rather than any numbers. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That would require proof, not just suggestion. �Skipping pages will >> defeat the OS read-ahead algorithm, and so could easily cost more than >> reading them. > My worry is what we have right now is also based on just assumptions > and gut feelings rather than any numbers. So go collect some numbers. regards, tom lane
On Sun, May 29, 2011 at 9:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pavan Deolasee <pavan.deolasee@gmail.com> writes: >> On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> That would require proof, not just suggestion. Skipping pages will >>> defeat the OS read-ahead algorithm, and so could easily cost more than >>> reading them. > >> My worry is what we have right now is also based on just assumptions >> and gut feelings rather than any numbers. > > So go collect some numbers. > I am sorry if I sounded terse above. But my gripe is that sometimes we are too reluctant to listen to ideas and insist on producing some hard numbers first which might take significant efforts. But we are not equally strict when such changes are introduced initially. For example, in this particular case, the change was introduced after this discussion: http://archives.postgresql.org/pgsql-hackers/2008-12/msg01316.php Heikki suggested 20, Simon proposed 32 to make it a power of 2. But why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is a better number based on his testings. May be a performance build farm as being discussed is the solution where we can throw some simple patches and see if something helps or not. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > I am sorry if I sounded terse above. But my gripe is that sometimes we > are too reluctant to listen to ideas and insist on producing some hard > numbers first which might take significant efforts. But we are not > equally strict when such changes are introduced initially. The reason for not wanting to change it without some actual evidence is that there is already evidence: the code has been in the field with this setting since 8.4, and nobody's vacuum performance has fallen off a cliff. So while I'd agree that there was little testing done before the code went in, there is more than zero reason to leave it where it is. Without some positive evidence showing that another value is better, I'm disinclined to change it. I also think that you're not helping by complaining about the code without being willing to do some work to try to collect such evidence. > Heikki suggested 20, Simon proposed 32 to make it a power of 2. But > why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is > a better number based on his testings. Greg said he had found that the read speed was the same for reading every page vs reading every 8th page. That's not the same as concluding that 8 is the optimal skip distance for vacuum; or at least, he didn't say that's what he had concluded. vacuum isn't just reading ... regards, tom lane
On Sun, May 29, 2011 at 10:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pavan Deolasee <pavan.deolasee@gmail.com> writes: >> I am sorry if I sounded terse above. But my gripe is that sometimes we >> are too reluctant to listen to ideas and insist on producing some hard >> numbers first which might take significant efforts. But we are not >> equally strict when such changes are introduced initially. > > The reason for not wanting to change it without some actual evidence > is that there is already evidence: the code has been in the field with > this setting since 8.4, and nobody's vacuum performance has fallen off a > cliff. Well, that's probably because there was definitely much improvement over what existed before. But that does not mean we can't make it better. IOW there are no complaints because there is no regression. > So while I'd agree that there was little testing done before the > code went in, there is more than zero reason to leave it where it is. > Without some positive evidence showing that another value is better, > I'm disinclined to change it. I also think that you're not helping > by complaining about the code without being willing to do some work > to try to collect such evidence. > I am not complaining about the code. I am suggesting we can be more receptive to ideas, especially when we know what we have today was not backed by any evidence either. I will anyways do some tests and post numbers when I work on single-pass vacuum patch. I'll try to experiment with this stuff at that time. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
From
Cédric Villemain
Date:
2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>: > Cédric Villemain <cedric.villemain.debian@gmail.com> writes: >> 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>: >>> OK, do you like the attached version of that logic? (Other fragments >>> of the patch as before.) > >> The idea was that remove only one page from the VACUUM will prevent >> relfrozenxid update and reltuples (and relpages) update. >> Now, I beleive that once we've skip at least one page thanks to >> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as >> many as possible pages from the VACUUM, tks to the VM. > > That would require proof, not just suggestion. Skipping pages will > defeat the OS read-ahead algorithm, and so could easily cost more than > reading them. Correct, it needs proof. Parenthesis: I did learn also that reading the first block of a file make read-ahead have its larger window from the beginning (the one that posix_fadvise_sequential set too), so remove those initial reads might be counter-productive also. But this is damn hard to benchmark because the read ahead is also influenced by memory pressure for example. From theory, 1. readahead algo is a bit smarter and can work with read-with-holes (if the holes are not larger than the read-ahead window) and 2. if holes are that large then maybe it is not so good to keep a larger read-ahead window (which keep trashing our buffer cache). -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support