RFC: planner statistics in 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject RFC: planner statistics in 7.2
Date
Msg-id 23052.987719865@sss.pgh.pa.us
Whole thread Raw
Responses Re: RFC: planner statistics in 7.2y  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: RFC: planner statistics in 7.2  (Philip Warner <pjw@rhyme.com.au>)
Re: RFC: planner statistics in 7.2  (Peter Eisentraut <peter_e@gmx.net>)
Re: RFC: planner statistics in 7.2  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Request for comments:

Overview
--------

The planner is currently badly handicapped by inadequate statistical
information.  The stats that VACUUM ANALYZE currently computes are:

per-table:number of disk pagesnumber of tuples

per-column:dispersionminimum and maximum values (if datatype has a suitable '<' operator)most common valuefraction of
entriesthat are the most common valuefraction of entries that are NULL
 

Not only is this an impoverished dataset, but the dispersion and
most-common-value are calculated in a very unreliable way, and cannot
be trusted.

Even though the stats are meager and untrustworthy, they are expensive
to get: the per-column data is updated only by VACUUM ANALYZE, which
is quite unpleasant to do regularly on large tables.

It is possible to get better statistics with less work.  Here is a
proposal for revising the stats mechanisms for 7.2.

I believe that it's still a good idea for the stats to be updated by an
explicit housekeeping command.  Although we could think of updating them
on-the-fly, that seems expensive and slow.  Another objection is that
planning results will become difficult to reproduce if the underlying
statistics change constantly.  But if we stick to the explicit-command
approach then we need to make the command much faster than VACUUM ANALYZE.
We can address that in two ways:

(1) The statistics-gathering process should be available as a standalone
command, ANALYZE [ tablename ], not only as part of VACUUM.  (This was
already discussed and agreed to for 7.1, but it never got done.)  Note
that a pure ANALYZE command needs only a read lock on the target table,
not an exclusive lock as VACUUM needs, so it's much more friendly to
concurrent transactions.

(2) Statistics should be computed on the basis of a random sample of the
target table, rather than a complete scan.  According to the literature
I've looked at, sampling a few thousand tuples is sufficient to give good
statistics even for extremely large tables; so it should be possible to
run ANALYZE in a short amount of time regardless of the table size.

If we do both of these then I think that ANALYZE will be painless enough
to do frequently, so there's no need to try to fold stats-updating into
regular operations.


More extensive statistics
-------------------------

The min, max, and most common value are not enough data, particularly
not for tables with heavily skewed data distributions.  Instead,
pg_statistic should store three small arrays for each column:

1. The K most common values and their frequencies, for some (adjustable)
parameter K.  This will be omitted if the column appears unique (no
duplicate values found).

2. The M boundary values of an equi-depth histogram, ie, the values that
divide the data distribution into equal-population sets.  For example, if
M is 3 this would consist of the min, median, and max values.

K and M might both be about 10 for a typical column.  In principle at
least, these numbers could be user-settable parameters, to allow trading
off estimation accuracy against amount of space used in pg_statistics.

A further refinement is that the histogram should describe the
distribution of the data *after excluding the given most-common values*
(that is, it's a "compressed histogram").  This allows more accurate
representation of the data distribution when there are just a few
very-common values.  For a column with not many distinct values (consider
a boolean column), the most-common-value list might describe the complete
dataset, in which case the histogram collapses to nothing.  (We'd still
have it store the min and max, just so that it's not necessary to scan the
most-common-value array to determine those values.)

I am also inclined to remove the attdispersion parameter in favor of
storing an estimate of the number of distinct values in the column.
We can compute more accurate selectivities using the most-common-value
frequencies and the distinct-values estimate than we could get from
dispersion.

Another useful idea is to try to estimate the correlation between physical
table order and logical order of any given column --- this would let us
account for the effect of clustering when estimating indexscan costs.
I don't yet know the appropriate formula to use, but this seems quite
doable.

Finally, it would be a good idea to add an estimate of average width
of varlena fields to pg_statistic.  This would allow the estimated
tuple width computed by the planner to have something to do with reality,
which in turn would improve the cost estimation for hash joins (which
need to estimate the size of the in-memory hash table).


Computing the statistics
------------------------

The best way to obtain these stats seems to be:

1. Scan the target table and obtain a random sample of R tuples.  R is
chosen in advance based on target histogram size (M) --- in practice R
will be about 3000 or so.  If the table contains fewer than that many
tuples then the "sample" will be the whole table.  The sample can be
obtained in one pass using Vitter's algorithm or similar.  Note that for
expected values of R we shouldn't have any trouble storing all the sampled
tuples in memory.

2. For each column in the table, extract the column value from each
sampled tuple, and sort these values into order.  A simple scan of the
values then suffices to find the most common values --- we only need to
count adjacent duplicates and remember the K highest counts.  After we
have those, simple arithmetic will let us find the positions that contain
the histogram boundary elements.  Again, all this can be done in-memory
and so should be pretty fast.

The sort step will require detoasting any toasted values.  To avoid risk
of memory overflow, we may exclude extremely wide toasted values from the
sort --- this shouldn't affect the stats much, since such values are
unlikely to represent duplicates.  Such an exclusion will also prevent
wide values from taking up lots of space in pg_statistic, if they happened
to be chosen as histogram entries.

Issue: for datatypes that have no '<' operator, we of course cannot
compute a histogram --- but if there is an '=' then the most-common-value
and number-of-distinct-values stats still make sense.  Is there a way to
derive these without O(R^2) brute-force comparisons?  We could still do
a scan of the R sample values using something like the existing
comparison algorithm (but using S > K work slots); this would cost about
S*R rather than R^2 comparisons.

A different approach that's been discussed on pghackers is to make use
of btree indexes for columns that have such indexes: we could scan the
indexes to visit all the column values in sorted order.  I have rejected
that approach because (a) it doesn't help for columns without a suitable
index; (b) our indexes don't distinguish deleted and live tuples,
which would skew the statistics --- in particular, we couldn't tell a
frequently-updated single tuple from a commonly repeated value; (c)
scanning multiple indexes would likely require more total I/O than just
grabbing sample tuples from the main table --- especially if we have to
do that anyway to handle columns without indexes.


User-visible changes
--------------------

Aside from implementing the stand-alone ANALYZE statement, it seems that
it would be a good idea to allow user control of the target numbers of
statistical entries (K and M above).  A simple approach would be a SET
variable or explicit parameter for ANALYZE.  But I am inclined to think
that it'd be better to create a persistent per-column state for this,
set by sayALTER TABLE tab SET COLUMN col STATS COUNT n
(better names welcome).  The target count for each column could be stored
in pg_attribute.  Doing it this way would let the dbadmin establish higher
or lower targets for especially irregular or simple columns, and then
forget about the numbers --- he wouldn't have to tweak his periodic cron
script to apply the right parameters to the right tables/columns.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Mike Mascari
Date:
Subject: Re: System catalog representation of access privileges
Next
From: Bruce Momjian
Date:
Subject: Re: RFC: planner statistics in 7.2y