Thread: Stopgap solution for table-size-estimate updating problem

Stopgap solution for table-size-estimate updating problem

From

Tom Lane

Date:

26 November 2004, 22:56:04

There's been some previous discussion of getting rid of the pg_class
columns relpages and reltuples, in favor of having the planner check the
current relation block count directly (RelationGetNumberOfBlocks) and
extrapolate the current tuple count based on the most recently measured
tuples-per-page density.  A couple of past threads are
http://archives.postgresql.org/pgsql-performance/2004-10/msg00367.php
http://archives.postgresql.org/pgsql-general/2004-08/msg01422.php
and the point came up again today:
http://archives.postgresql.org/pgsql-performance/2004-11/msg00401.php
where we were again reminded of the problems incurred by obsolete
estimates.

It occurs to me that we could get most of the bang for the buck without
making any incompatible changes: just leave the existing fields in place
but make the planner use reltuples-divided-by-relpages as the density
estimate.  That is, in place of where we have
rel->pages = relation->rd_rel->relpages;rel->tuples = relation->rd_rel->reltuples;

in plancat.c, just do
rel->pages = RelationGetNumberOfBlocks(relation);if (relation->rd_rel->relpages > 0)    density =
relation->rd_rel->reltuples/ relation->rd_rel->relpages;else    density = some_default_estimate;rel->tuples =
round(rel->pages* density);
 

In addition to this we'd perhaps want to hack VACUUM so that when the
table is empty, it doesn't simply zero out relpages/reltuples, but
somehow preserves the previous density value so we don't have to fall
back to the default density estimate.  (This of course assumes that we
will refill the table with a density roughly similar to the last meaured
density; which might be wrong but it's still better than just using a
default, I think.)  One way to do that is to set relpages = zero
(truthfully) but set reltuples to the previously estimated density
(we can do this because it's already a float field).  It might look a
little funny to have nonzero reltuples when relpages is zero, but I
think it wouldn't break anything.  Then the above logic becomes
rel->pages = RelationGetNumberOfBlocks(relation);if (relation->rd_rel->relpages > 0)    density =
relation->rd_rel->reltuples/ relation->rd_rel->relpages;else if (relation->rd_rel->reltuples > 0)  /* already a density
*/   density = relation->rd_rel->reltuples;else    density = some_default_estimate;rel->tuples = round(rel->pages *
density);

A variant of this is to set reltuples = density, relpages = 1 instead
of 0, which makes the relpages value a lie but would be even less likely
to confuse client-side code.

Comments?  Does this seem like something reasonable to do for 8.0?
        regards, tom lane

Re: Stopgap solution for table-size-estimate updating problem

From

"Zeugswetter Andreas DAZ SD"

Date:

27 November 2004, 00:15:21

>    rel->pages = RelationGetNumberOfBlocks(relation);

Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the
optimizer ?

I myself have always preferred more stable estimates that only change
when told to. I never liked that vacuum (without analyze) and create index
change those values, imho only analyze should. But I am used to applications
that prepare a query and hold the plan for days or weeks. If you happen to
create the plan when the table is by chance empty you lost.
But it seems I am quite alone with that opinion.

> A variant of this is to set reltuples = density, relpages = 1 instead
> of 0, which makes the relpages value a lie but would be even less likely
> to confuse client-side code.

I would let analyze still fill in the correct values.
The 0 or 1 page previous denstity could be used for only when the table is empty.
I don't think eighter would confuse client-side code more or less.
More confusing for client-side code would be if the two would not show the
analyze results any more.

I would like to see this configurable. Like "use_analyzed_stats" or
"enable_online_statistics" or some such that can be used more future
online statistics.
One setting would use:
pages = RelationGetNumberOfBlocks(relation)
count = pages * reltuples / relpages

the other:
pages = relpages
count = relcount

Andreas

Re: Stopgap solution for table-size-estimate updating problem

From

Tom Lane

Date:

27 November 2004, 00:55:18

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
>> rel->pages = RelationGetNumberOfBlocks(relation);

> Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the
> optimizer ?

It's basically going to cost one extra lseek() kernel call ... per
query, per table referenced in the query.  I no longer think this is
a killer argument.  lseek isn't going to induce any I/O, so it should
be cheap as kernel calls go.

> I myself have always preferred more stable estimates that only change
> when told to. I never liked that vacuum (without analyze) and create index
> change those values, imho only analyze should.

Indeed, this is probably the most significant argument in favor of
leaving things as they are.  But the other side of the coin is: why
shouldn't a bunch of inserts or updates cause the plan to change?
The people who are complaining about it have certainly posted plenty
of examples where it would help to change the plans.

> But I am used to applications
> that prepare a query and hold the plan for days or weeks. If you happen to 
> create the plan when the table is by chance empty you lost.

You lose in either case, since this proposal doesn't change when
planning occurs or doesn't occur.
        regards, tom lane

Re: Stopgap solution for table-size-estimate updating

From

Simon Riggs

Date:

28 November 2004, 16:35:08

On Sat, 2004-11-27 at 00:54, Tom Lane wrote:
> "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> >> rel->pages = RelationGetNumberOfBlocks(relation);
> 
> > Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the
> > optimizer ?
> 
> It's basically going to cost one extra lseek() kernel call ... per
> query, per table referenced in the query.  I no longer think this is
> a killer argument.  lseek isn't going to induce any I/O, so it should
> be cheap as kernel calls go.
> 

OK, didn't believe it at first, but it is just one call, since md.c's
mdnblocks() is already optimized to avoid O(N) behaviour in terms of
kernel calls. Cool.

> > I myself have always preferred more stable estimates that only change
> > when told to. I never liked that vacuum (without analyze) and create index
> > change those values, imho only analyze should.
> 
> Indeed, this is probably the most significant argument in favor of
> leaving things as they are.  But the other side of the coin is: why
> shouldn't a bunch of inserts or updates cause the plan to change?
> The people who are complaining about it have certainly posted plenty
> of examples where it would help to change the plans.
> 

ISTM that you both have good arguments. Andreas' argument is what most
people expect to happen, if they come from other commercial RDBMS. This
was my starting place, though upon reflection I think Tom's proposal
seems to be the way of the future, even if it does seem now to be a more
dynamic approach with less direct control for the DBA.

If we look at this from the perspective of how many people will post
problems about this issue, I'd say this late in the beta cycle there's a
big risk that it will cause much grief. However, the issue already does
cause much grief to those who don't manage it well (as Richard Huxton
points out on a previous thread today). There doesn't seem any benefit
in staying where we are today apart for those few who already precisely
and accurately control statistics collection.

Andreas called for a GUC to control that behaviour. Given that the more
dynamic behaviour suggested needs to be turned on by default, it does
seem reasonable to have a GUC that allows you to turn it off. There may
be other side effects discovered later that require more manual control
and it would make sense at that point to have a switch to turn it off if
not required. 

So, I vote in favour of the new dynamic estimation method to be added to
8.0, on by default, but with a GUC to turn off if problems arise. 

... enable_dynamic_statistics=true

If it holds good, like I'm sure it will then this can be deprecated
later. Many other aspects of statistics collection can occur dynamically
also, such as post-execution cardinality statistics.

Or perhaps some_default_estimate was itself a GUC, that would turn off
this feature off when it was set to 0...? If not, what value is
proposed?

On the topic of accuracy of the estimate: Updates cause additional data
to be written to the table, so tables get bigger until vacuumed. Tables
with many Inserts are also regularly trimmed with Deletes. With a
relatively static workload and a regular vacuum cycle, the table size
for many major tables eventually levels off, remaining roughly constant
but the number of non-zero pages will vary over time in a saw-tooth
curve. Estimating the cardinality by using the number of blocks would
ignore the fact that many of them are empty for much of the time. That
would then lead to a systematic over-estimate of the cardinality of the
regularly updated tables. You have to take the estimate from somewhere,
but I note that current practice of using a VACUUM ANALYZE would mean
that the statistics would be collected when free space in the table was
highest. That estimate would differ from the dynamic method suggested
since this would lead to a calculation equivalent to the taking an
ANALYZE immediately before a VACUUM, rather than after it. 
How easy would it be to take into account the length of the FSM for the
relation also?

[...IIRC DB2 has a VOLATILE option at table level, which enables dynamic
estimation of statistics.]

-- 
Best Regards, Simon Riggs

Re: Stopgap solution for table-size-estimate updating problem

From

Tom Lane

Date:

28 November 2004, 18:52:51

Simon Riggs <simon@2ndquadrant.com> writes:
> On the topic of accuracy of the estimate: Updates cause additional data
> to be written to the table, so tables get bigger until vacuumed. Tables
> with many Inserts are also regularly trimmed with Deletes. With a
> relatively static workload and a regular vacuum cycle, the table size
> for many major tables eventually levels off, remaining roughly constant
> but the number of non-zero pages will vary over time in a saw-tooth
> curve. Estimating the cardinality by using the number of blocks would
> ignore the fact that many of them are empty for much of the time. That
> would then lead to a systematic over-estimate of the cardinality of the
> regularly updated tables.

You mean underestimate.  After a VACUUM, the tuples-per-page figure
would be set to a relatively low value, and then subsequent inserts
would fill in the free space, causing the actual density to rise
while the physical number of blocks stays more or less constant.
So the proposed method would always give an accurate number of blocks,
but it would tend to underestimate the number of tuples in a dynamic
situation.

Still, it's better than the current method, which is likely to
underestimate both parameters.  I believe that having an accurate block
count and an underestimated tuple count would tend to favor choosing
indexscans over seqscans, which is probably a good thing --- when was
the last time you saw someone complaining that the planner had
improperly chosen an indexscan over a seqscan?

> How easy would it be to take into account the length of the FSM for the
> relation also?

Don't think this would help; the FSM doesn't really track number of
tuples.  Free space isn't a good guide to number of tuples because you
can't distinguish inserts from updates at that level.  (I'm also a bit
concerned about turning the FSM into a source of contention.)
        regards, tom lane

Re: Stopgap solution for table-size-estimate updating

From

Simon Riggs

Date:

28 November 2004, 22:19:34

On Sun, 2004-11-28 at 18:52, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On the topic of accuracy of the estimate: Updates cause additional data
> > to be written to the table, so tables get bigger until vacuumed. Tables
> > with many Inserts are also regularly trimmed with Deletes. With a
> > relatively static workload and a regular vacuum cycle, the table size
> > for many major tables eventually levels off, remaining roughly constant
> > but the number of non-zero pages will vary over time in a saw-tooth
> > curve. Estimating the cardinality by using the number of blocks would
> > ignore the fact that many of them are empty for much of the time. That
> > would then lead to a systematic over-estimate of the cardinality of the
> > regularly updated tables.
> 
> You mean underestimate.  After a VACUUM, the tuples-per-page figure
> would be set to a relatively low value, and then subsequent inserts
> would fill in the free space, causing the actual density to rise
> while the physical number of blocks stays more or less constant.
> So the proposed method would always give an accurate number of blocks,
> but it would tend to underestimate the number of tuples in a dynamic
> situation.

OK, just *wrong* then (the estimate, as well as myself) :-)

> Still, it's better than the current method, which is likely to
> underestimate both parameters.  I believe that having an accurate block
> count and an underestimated tuple count would tend to favor choosing
> indexscans over seqscans, which is probably a good thing --- when was
> the last time you saw someone complaining that the planner had
> improperly chosen an indexscan over a seqscan?

Well, yes, but user perception is not always right.

Given we expect an underestimate, can we put in a correction factor
should the estimate get really low...sounds like we could end up
choosing nested joins more often when we should have chosen merge joins.
That is something that people do regularly complain about (indirectly).

> > How easy would it be to take into account the length of the FSM for the
> > relation also?
> 
> Don't think this would help; the FSM doesn't really track number of
> tuples.  Free space isn't a good guide to number of tuples because you
> can't distinguish inserts from updates at that level.  (I'm also a bit
> concerned about turning the FSM into a source of contention.)

Agreed.

-- 
Best Regards, Simon Riggs

Re: Stopgap solution for table-size-estimate updating problem

From

Tom Lane

Date:

28 November 2004, 22:36:06

Simon Riggs <simon@2ndquadrant.com> writes:
> Given we expect an underestimate, can we put in a correction factor
> should the estimate get really low...sounds like we could end up
> choosing nested joins more often when we should have chosen merge joins.

One possibility: vacuum already knows how many tuples it removed.  We
could set reltuples equal to, say, the mean of the number-of-tuples-
after-vacuuming and the number-of-tuples-before.  In a steady state
situation this would represent a fairly reasonable choice.  In cases
where the table size has actually decreased permanently, it'd take a few
cycles of vacuuming before reltuples converges to the new value, but that
doesn't seem too bad.

A standalone ANALYZE should still do what it does now, though, I think;
namely set reltuples to its best estimate of the current value.
        regards, tom lane

Re: Stopgap solution for table-size-estimate updating

From

Simon Riggs

Date:

29 November 2004, 00:06:39

On Sun, 2004-11-28 at 22:35, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Given we expect an underestimate, can we put in a correction factor
> > should the estimate get really low...sounds like we could end up
> > choosing nested joins more often when we should have chosen merge joins.
> 
> One possibility: vacuum already knows how many tuples it removed.  We
> could set reltuples equal to, say, the mean of the number-of-tuples-
> after-vacuuming and the number-of-tuples-before.  In a steady state
> situation this would represent a fairly reasonable choice.  In cases
> where the table size has actually decreased permanently, it'd take a few
> cycles of vacuuming before reltuples converges to the new value, but that
> doesn't seem too bad.

That sounds good to me.  Covers all cases I can see from here.

> A standalone ANALYZE should still do what it does now, though, I think;
> namely set reltuples to its best estimate of the current value.

A GUC-free solution...but yet manual control is possible. Sounds good to
me - and for you Andreas, also?

-- 
Best Regards, Simon Riggs