Thread: Stopgap solution for table-size-estimate updating problem
There's been some previous discussion of getting rid of the pg_class columns relpages and reltuples, in favor of having the planner check the current relation block count directly (RelationGetNumberOfBlocks) and extrapolate the current tuple count based on the most recently measured tuples-per-page density. A couple of past threads are http://archives.postgresql.org/pgsql-performance/2004-10/msg00367.php http://archives.postgresql.org/pgsql-general/2004-08/msg01422.php and the point came up again today: http://archives.postgresql.org/pgsql-performance/2004-11/msg00401.php where we were again reminded of the problems incurred by obsolete estimates. It occurs to me that we could get most of the bang for the buck without making any incompatible changes: just leave the existing fields in place but make the planner use reltuples-divided-by-relpages as the density estimate. That is, in place of where we have rel->pages = relation->rd_rel->relpages;rel->tuples = relation->rd_rel->reltuples; in plancat.c, just do rel->pages = RelationGetNumberOfBlocks(relation);if (relation->rd_rel->relpages > 0) density = relation->rd_rel->reltuples/ relation->rd_rel->relpages;else density = some_default_estimate;rel->tuples = round(rel->pages* density); In addition to this we'd perhaps want to hack VACUUM so that when the table is empty, it doesn't simply zero out relpages/reltuples, but somehow preserves the previous density value so we don't have to fall back to the default density estimate. (This of course assumes that we will refill the table with a density roughly similar to the last meaured density; which might be wrong but it's still better than just using a default, I think.) One way to do that is to set relpages = zero (truthfully) but set reltuples to the previously estimated density (we can do this because it's already a float field). It might look a little funny to have nonzero reltuples when relpages is zero, but I think it wouldn't break anything. Then the above logic becomes rel->pages = RelationGetNumberOfBlocks(relation);if (relation->rd_rel->relpages > 0) density = relation->rd_rel->reltuples/ relation->rd_rel->relpages;else if (relation->rd_rel->reltuples > 0) /* already a density */ density = relation->rd_rel->reltuples;else density = some_default_estimate;rel->tuples = round(rel->pages * density); A variant of this is to set reltuples = density, relpages = 1 instead of 0, which makes the relpages value a lie but would be even less likely to confuse client-side code. Comments? Does this seem like something reasonable to do for 8.0? regards, tom lane
Re: Stopgap solution for table-size-estimate updating problem
From
"Zeugswetter Andreas DAZ SD"
Date:
> rel->pages = RelationGetNumberOfBlocks(relation); Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the optimizer ? I myself have always preferred more stable estimates that only change when told to. I never liked that vacuum (without analyze) and create index change those values, imho only analyze should. But I am used to applications that prepare a query and hold the plan for days or weeks. If you happen to create the plan when the table is by chance empty you lost. But it seems I am quite alone with that opinion. > A variant of this is to set reltuples = density, relpages = 1 instead > of 0, which makes the relpages value a lie but would be even less likely > to confuse client-side code. I would let analyze still fill in the correct values. The 0 or 1 page previous denstity could be used for only when the table is empty. I don't think eighter would confuse client-side code more or less. More confusing for client-side code would be if the two would not show the analyze results any more. I would like to see this configurable. Like "use_analyzed_stats" or "enable_online_statistics" or some such that can be used more future online statistics. One setting would use: pages = RelationGetNumberOfBlocks(relation) count = pages * reltuples / relpages the other: pages = relpages count = relcount Andreas
"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes: >> rel->pages = RelationGetNumberOfBlocks(relation); > Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the > optimizer ? It's basically going to cost one extra lseek() kernel call ... per query, per table referenced in the query. I no longer think this is a killer argument. lseek isn't going to induce any I/O, so it should be cheap as kernel calls go. > I myself have always preferred more stable estimates that only change > when told to. I never liked that vacuum (without analyze) and create index > change those values, imho only analyze should. Indeed, this is probably the most significant argument in favor of leaving things as they are. But the other side of the coin is: why shouldn't a bunch of inserts or updates cause the plan to change? The people who are complaining about it have certainly posted plenty of examples where it would help to change the plans. > But I am used to applications > that prepare a query and hold the plan for days or weeks. If you happen to > create the plan when the table is by chance empty you lost. You lose in either case, since this proposal doesn't change when planning occurs or doesn't occur. regards, tom lane
On Sat, 2004-11-27 at 00:54, Tom Lane wrote: > "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes: > >> rel->pages = RelationGetNumberOfBlocks(relation); > > > Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the > > optimizer ? > > It's basically going to cost one extra lseek() kernel call ... per > query, per table referenced in the query. I no longer think this is > a killer argument. lseek isn't going to induce any I/O, so it should > be cheap as kernel calls go. > OK, didn't believe it at first, but it is just one call, since md.c's mdnblocks() is already optimized to avoid O(N) behaviour in terms of kernel calls. Cool. > > I myself have always preferred more stable estimates that only change > > when told to. I never liked that vacuum (without analyze) and create index > > change those values, imho only analyze should. > > Indeed, this is probably the most significant argument in favor of > leaving things as they are. But the other side of the coin is: why > shouldn't a bunch of inserts or updates cause the plan to change? > The people who are complaining about it have certainly posted plenty > of examples where it would help to change the plans. > ISTM that you both have good arguments. Andreas' argument is what most people expect to happen, if they come from other commercial RDBMS. This was my starting place, though upon reflection I think Tom's proposal seems to be the way of the future, even if it does seem now to be a more dynamic approach with less direct control for the DBA. If we look at this from the perspective of how many people will post problems about this issue, I'd say this late in the beta cycle there's a big risk that it will cause much grief. However, the issue already does cause much grief to those who don't manage it well (as Richard Huxton points out on a previous thread today). There doesn't seem any benefit in staying where we are today apart for those few who already precisely and accurately control statistics collection. Andreas called for a GUC to control that behaviour. Given that the more dynamic behaviour suggested needs to be turned on by default, it does seem reasonable to have a GUC that allows you to turn it off. There may be other side effects discovered later that require more manual control and it would make sense at that point to have a switch to turn it off if not required. So, I vote in favour of the new dynamic estimation method to be added to 8.0, on by default, but with a GUC to turn off if problems arise. ... enable_dynamic_statistics=true If it holds good, like I'm sure it will then this can be deprecated later. Many other aspects of statistics collection can occur dynamically also, such as post-execution cardinality statistics. Or perhaps some_default_estimate was itself a GUC, that would turn off this feature off when it was set to 0...? If not, what value is proposed? On the topic of accuracy of the estimate: Updates cause additional data to be written to the table, so tables get bigger until vacuumed. Tables with many Inserts are also regularly trimmed with Deletes. With a relatively static workload and a regular vacuum cycle, the table size for many major tables eventually levels off, remaining roughly constant but the number of non-zero pages will vary over time in a saw-tooth curve. Estimating the cardinality by using the number of blocks would ignore the fact that many of them are empty for much of the time. That would then lead to a systematic over-estimate of the cardinality of the regularly updated tables. You have to take the estimate from somewhere, but I note that current practice of using a VACUUM ANALYZE would mean that the statistics would be collected when free space in the table was highest. That estimate would differ from the dynamic method suggested since this would lead to a calculation equivalent to the taking an ANALYZE immediately before a VACUUM, rather than after it. How easy would it be to take into account the length of the FSM for the relation also? [...IIRC DB2 has a VOLATILE option at table level, which enables dynamic estimation of statistics.] -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On the topic of accuracy of the estimate: Updates cause additional data > to be written to the table, so tables get bigger until vacuumed. Tables > with many Inserts are also regularly trimmed with Deletes. With a > relatively static workload and a regular vacuum cycle, the table size > for many major tables eventually levels off, remaining roughly constant > but the number of non-zero pages will vary over time in a saw-tooth > curve. Estimating the cardinality by using the number of blocks would > ignore the fact that many of them are empty for much of the time. That > would then lead to a systematic over-estimate of the cardinality of the > regularly updated tables. You mean underestimate. After a VACUUM, the tuples-per-page figure would be set to a relatively low value, and then subsequent inserts would fill in the free space, causing the actual density to rise while the physical number of blocks stays more or less constant. So the proposed method would always give an accurate number of blocks, but it would tend to underestimate the number of tuples in a dynamic situation. Still, it's better than the current method, which is likely to underestimate both parameters. I believe that having an accurate block count and an underestimated tuple count would tend to favor choosing indexscans over seqscans, which is probably a good thing --- when was the last time you saw someone complaining that the planner had improperly chosen an indexscan over a seqscan? > How easy would it be to take into account the length of the FSM for the > relation also? Don't think this would help; the FSM doesn't really track number of tuples. Free space isn't a good guide to number of tuples because you can't distinguish inserts from updates at that level. (I'm also a bit concerned about turning the FSM into a source of contention.) regards, tom lane
On Sun, 2004-11-28 at 18:52, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On the topic of accuracy of the estimate: Updates cause additional data > > to be written to the table, so tables get bigger until vacuumed. Tables > > with many Inserts are also regularly trimmed with Deletes. With a > > relatively static workload and a regular vacuum cycle, the table size > > for many major tables eventually levels off, remaining roughly constant > > but the number of non-zero pages will vary over time in a saw-tooth > > curve. Estimating the cardinality by using the number of blocks would > > ignore the fact that many of them are empty for much of the time. That > > would then lead to a systematic over-estimate of the cardinality of the > > regularly updated tables. > > You mean underestimate. After a VACUUM, the tuples-per-page figure > would be set to a relatively low value, and then subsequent inserts > would fill in the free space, causing the actual density to rise > while the physical number of blocks stays more or less constant. > So the proposed method would always give an accurate number of blocks, > but it would tend to underestimate the number of tuples in a dynamic > situation. OK, just *wrong* then (the estimate, as well as myself) :-) > Still, it's better than the current method, which is likely to > underestimate both parameters. I believe that having an accurate block > count and an underestimated tuple count would tend to favor choosing > indexscans over seqscans, which is probably a good thing --- when was > the last time you saw someone complaining that the planner had > improperly chosen an indexscan over a seqscan? Well, yes, but user perception is not always right. Given we expect an underestimate, can we put in a correction factor should the estimate get really low...sounds like we could end up choosing nested joins more often when we should have chosen merge joins. That is something that people do regularly complain about (indirectly). > > How easy would it be to take into account the length of the FSM for the > > relation also? > > Don't think this would help; the FSM doesn't really track number of > tuples. Free space isn't a good guide to number of tuples because you > can't distinguish inserts from updates at that level. (I'm also a bit > concerned about turning the FSM into a source of contention.) Agreed. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Given we expect an underestimate, can we put in a correction factor > should the estimate get really low...sounds like we could end up > choosing nested joins more often when we should have chosen merge joins. One possibility: vacuum already knows how many tuples it removed. We could set reltuples equal to, say, the mean of the number-of-tuples- after-vacuuming and the number-of-tuples-before. In a steady state situation this would represent a fairly reasonable choice. In cases where the table size has actually decreased permanently, it'd take a few cycles of vacuuming before reltuples converges to the new value, but that doesn't seem too bad. A standalone ANALYZE should still do what it does now, though, I think; namely set reltuples to its best estimate of the current value. regards, tom lane
On Sun, 2004-11-28 at 22:35, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Given we expect an underestimate, can we put in a correction factor > > should the estimate get really low...sounds like we could end up > > choosing nested joins more often when we should have chosen merge joins. > > One possibility: vacuum already knows how many tuples it removed. We > could set reltuples equal to, say, the mean of the number-of-tuples- > after-vacuuming and the number-of-tuples-before. In a steady state > situation this would represent a fairly reasonable choice. In cases > where the table size has actually decreased permanently, it'd take a few > cycles of vacuuming before reltuples converges to the new value, but that > doesn't seem too bad. That sounds good to me. Covers all cases I can see from here. > A standalone ANALYZE should still do what it does now, though, I think; > namely set reltuples to its best estimate of the current value. A GUC-free solution...but yet manual control is possible. Sounds good to me - and for you Andreas, also? -- Best Regards, Simon Riggs