Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Simon Riggs
Subject Improving N-Distinct estimation by ANALYZE
Date
Msg-id 1136401829.21025.151.camel@localhost.localdomain
Whole thread Raw
Responses Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE
List pgsql-hackers
Improving N-Distinct estimation
===============================
v1.1

OBJECTIVES

Answer these points...

- Are there any performance issues that can be directly attributed to
mis-estimation of N-Distinct ("D") by the ANALYZE command? 

- If so, can we do better than we currently achieve? How?

- Would altering the estimate of D cause problems in other places?

Comments are sought on the problem and the possible solutions. Please
get involved if you can help with detailed analyses on this topic.

SUMMARY

The estimation of D is difficult and imprecise. The current method works
well in many cases, yet breaks down *badly* in one particular very
common use case of database design: a large dependent table with a
multi-column Primary Key. In some cases the estimate of D *decreases* as
the size of the table increases, though the estimate of D is an
underestimate in almost all cases, whatever the table design.

PostgreSQL cvstip currently seriously under estimates D for very large
dependent tables with rows that are clustered around one of the columns
of a multi-column Primary Key. The mis-estimation identified can lead to
poor system performance should the mis-estimation lead to the use of
HashAggregate or general mis-selection of optimal join plans.

An example of this is the orders and lineitem tables from DBT-3/TPC-H,
which have a 1:M relationship. There are M lineitems associated with
each order and all are inserted in one transaction into lineitem. If M
is relatively large, then problems may ensue.
A problem SQL statement may be something like:       SELECT l_orderkey, sum(l_extendedprice) from lineitem;

This issue could have an impact on any large table in a 1:M relationship
where the actual number of distinct values, D, is much larger than the
sample size, n. (D >> n). This can also effect associative tables where
more than one 1:M relationship exists to separate tables, such as Fact
tables in a star schema.

The issue is alleviated by setting the column statistics target higher,
though this merely increases the size range of the table over which
problems may occur.

There are a number of ways we can improve on the current estimates,
using techniques suggested in later statistical research.

Some proposals are made and comments are sought on the problem and the
possible solutions.

It is possible that we need more than one estimate of D for various
purposes. We might potentially need a low estimate of D for use in join
planning, whereas a higher estimate to reduce the risk of hash table
operations. This approach might be taken initially to allow us to
implement improved estimators without throwing out many previously good
plans.

WHAT WE CURRENTLY DO WITH ANALYZE

Notation
D = estimate of the number of distinct values (aka n_distinct)
N = number of rows in table
n = number of rows in sample

Sampling method
* Fixed sample size, no matter how big table, following Chaudhuri et
al's 1998 paper on sample size sufficiency for statistics histograms.

* Sample blocks = sample rows = 300 * col stats target 

Results
* Count rows/value for all values observed in sample
f1 = number of unique values in sample
d = number of values in sample

* If f1 == n => assume unique: D = N and scale with N
else If f1 == 0 => assume D = d
else => apply Haas-Stokes [1998] estimator

* If D > 10% of N => scale with N

[There are a variety of techniques selected from Haas-Stokes [1998],
Chaudhuri et al [1998], Vitter and Knuth. Sometimes these authors have
discussed the same subject and come up with different answers, so you
need to be careful to say which reference you mean when discussing
these.]

ISSUES

1. Estimation of D; mentioned above and covered in more detail below.
(see ESTIMATES OF D FOR DEPENDENT TABLES)

2. The sample size calculation correctly follows Chaudhuri et al [1998]
when the number of rows in the table is 1 million. However, smaller
tables are overestimated and larger tables are underestimated. The
sample size should be multiplied by 2.3 (i.e. ln(10)) for every x10
larger table size. e.g. a 100 million row table requires sample size 4.6
times larger to have the same accuracy for histogram selection.


OBSERVATIONS

1. *All* methods of statistical analysis are improved by larger sample
fractions. The D estimator method currently in use shows an optimum of
accuracy and sample fraction at around 5% of a table, as shown in the
author's original paper [Haas Stokes (1998)]. The current
implementation's error rates climb higher as table size increases.

2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
row sampling technique rather than a block sampling technique. This
would translate directly into a performance drop from large sample
ratios, but since we currently use a fixed sample size this problem is
not yet visible for larger tables. With a 2GB table, we would typically
sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

3. Large values of statistics target (up to 1000) could cause a number
of problems with statistics catalog table growth (mentioned on
-perform). Setting these values appropriately can take significant
effort. Automatic scaling of such parameters is desirable.

4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
nor does it use less if it is set low. Actual memory usage isn't
measured at all. With very long rows this is max 24 MB with default
stats target settings and BLCKSZ, or 2.4 GB with highest stats target
(1000). This probably is of lower importance since stats targets are
only usually set higher on larger databases, which typically have larger
memory configurations anyway - and very long rows are uncommon because
of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default
settings i.e. maybe as low as a 0.01% sample for a very large table.

ESTIMATES OF D FOR DEPENDENT TABLES

Lets look at this example from DBT-3/TPC-H:

Two tables, orders and lineitem. Each row in orders has M rows in
lineitem, so they have a 1:M relationship.

create table orders (o_orderkey integer, o_custkey integer,
o_orderstatus char(1), o_totalprice numeric(12,2), o_orderdate date,
o_orderpriority char(15), o_clerk char(15), o_shippriority integer,
o_comment varchar(79), primary key ( o_orderkey ) );

create table lineitem (l_orderkey integer, l_partkey integer,
l_suppkey integer, l_linenumber integer, l_quantity numeric(12,2),
l_extendedprice numeric(12,2), l_discount numeric(12,2), l_tax
numeric(12,2), l_returnflag char(1), l_linestatus char(1), l_shipdate
date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25),
l_shipmode char(10), l_comment varchar(44), primary key ( l_orderkey,
l_linenumber ) );

where lineitem.l_orderkey references o_orderkey

The rows in lineitem are all inserted in the same transaction that the
order is inserted. As a result, they are very likely to be in the same
data block, or adjacent data blocks (*)

ANALYZE randomly samples rows, so that the average gap between randomly
selected rows increases as the table size increases, because of the
fixed sample size. Since the clustered rows are typically close
together, then the apparent number of multiple instances of the same
data value decreases as the sample fraction decreases. Since the sample
size is currently fixed, this means that the D estimate decreases as the
table size increases. (This is proven in a test case below).

(*) The only alleviation from this effect occurs when we have the FSM
full of randomly placed blocks. In that case, we will sometimes get
consecutively INSERTed orderline rows in blocks that are wide apart
within the table. However in our example, the fulfillment of orders is
not random and so blocks with freespace tend to appear first at the
beginning of the table and cycle through the table over time. So, even
in the unlikely event that we have rows with the same l_orderkey value
widely separated in the table, we are still unlikely to actually observe
that when sample fraction is low. Data Warehouse applications seldom
delete rows, hence the FSM is unused, so this slight cause for hope is
unlikely to exist.

There is a further effect of concern here. We currently apply a "lift"
to the D estimate when we think that the number of distinct values is
high enough to imply that it should be scaled according to N. In the
above example, when sample ratio is too small, D will hit the point
where it is too low to be scaled and we suddenly "bomb out" to a much
lower value.

Test results on a table constructed as follows:
Number of orderkey vals Rows per orderkey
200,000             2
200,000             4
200,000             6
200,000             8
200,000             10
Total orderkey values = 1,000,000 (D-exact)
Total rows = 6,000,000

orderkey stats target D estimates
10 (10% blocks, 3k rows read)   106k, 113k, 114k
20 (20% blocks, 6k rows read)   201k, 185k, 211k
30 (30% blocks, 9k rows read)   301k, 303k, 324k
40 (40% blocks, 12k rows read)  431k, 378k
60 (60% blocks, 18k rows read)  530k
80 (80% blocks, 24k rows read)  646k(-0.107), 732k(-0.122)
100 (all blocks, 30k rows read) 823k(-0.137), 782k(-0.13), 785k(-0.13)
200 (all blocks, 60k rows read) 794k(-0.132), 810k(-0.135)

The numbers in brackets denote that we have inferred scaling by N should
occur and that we can estimate D as -(1/n_distinct).

The test uses increasing stats target to increase sample size. I assume
that increasing the table size while maintaining the data distribution,
yet maintaining constant sample would have the same effect as the
results shown here, since they would offer similar sample fractions.

The results show that D is *inversely* proportional to sample fraction,
but only over the middle range of sample sizes. At each end of the
sample size scale, we handle matters correctly. If we sample enough, we
recognise the actual distribution and decide to scale the result. If we
sample a very small fraction and this reveals an apparently unique
distribution, we decide that the distribution itself is unique and we
suddenly decide D = N.

The test results don't seem too bad if you view the estimate of D as at
most a factor of 10 wrong. However, since the error scales up with the
size of the table, we can imagine very large estimation errors.

The results show that even when we scan 60% of the example table we
consistently get an answer around half of the actual value of D. This is
misleading because we are using row sampling, so we have sampled 1800
blocks out of 29412, yet only examined 1800 rows out of 6 million. So
we've paid the high I/O cost to scan the table, but not got the benefit
from that. That could lead to the conclusion that increasing sample size
has little or no benefit, but that is only true if we use row rather
than block sampling techniques.

We do currently already calculate the correlation for each attribute, so
this could be used to influence the results. However, correlation value
itself may be poor in a steady state table such as line_item since the
filling of data blocks is likely to be cyclic.

REVIEW OF RESEARCH

Andrew Dunstan points out Brutlag & Richardson [2000] "A Block Sampling
Approach to Distinct Value Estimation". In this paper a number of
different techniques are reviewed. [Some previous block-based estimators
have been proposed on-list, though these had less statistical basis and
the range of application was based on apriori knowledge of D]

>From it we note that block sampling is an effective technique when the
attributes are genuinely randomly distributed within the table. Block
sampling is also effective at identifying clustered (as opposed to
merely correlated) attributes with relatively low sample fractions.

Chaudhuri's estimator is based on a least risk approach, rather than a
greatest accuracy approach, which does sound appealing should we not be
able to apply an improved estimator.

Haas & Stokes [1998] note that "Initial experiments indicated that the
reduction in estimation error due to using the high-order jack-knife is
outweighed by the increase in error due to uncertainty in the moment
estimates." ...Experiments both here and by Brutlag & Richardson show
that not doing so leads to poor results with clustered and correlated
data.

Brutlag & Richardson conclude that "a hybrid estimator" may yield a
better approach than the ones they have come up with.

Chaudhuri et al [1998] mention an "adaptive page sampling" approach.

PROPOSALS

The current sampling approach visits many blocks yet retrieves few rows.

We know that:
i) block sampling is an effective means of detecting clustered data
ii) block sampling could lead to bias in the conclusions reached
ii) the current ANALYZE approach has problems with clustered data

I propose

1. decide that D should scale with N if the |correlation| > an arbitrary
limit [0.99 suggested to ensure only highly certain correlations are
taken] (and only if we have not already decided that D > d). This alone
would "fix" the estimate of D for all of the cases highlighted in my
test case, above. However, my test was both correlated and clustered -
and many real cases would be clustered only (say if people are reading
values from a sequence before insertion the correlation would not be so
near perfect).

2. When 
i) we have a large table (> 1 000 000 rows if known, or > 50 000 * max
stats target blocks)
ii) we have ample additional memory to collect a larger sample size

that we make these relatively minor changes to the current approach. 

a). we collect a sample that consists of both block and row sampled
data. Any estimator applied to this data will then naturally be a hybrid
estimator. This would be done in such a way that no new I/O would be
incurred, i.e. we would block sample *some* of the blocks which we will
read for row sampling purposes. Doing that will increase our knowledge
of clustered data, as well as lifting the sample size as is required to
maintain the accuracy of histogram calculation.

This means we will still use a fixed size sample, but the sample size
will be larger according to maintenance_work_mem. The sample will still
fit in memory, so the required sorts will still be fast memory sorts.
Also, the sample size is larger without also increasing the statistics
targets for particular columns and this will happen automatically when
maintenance_work_mem is set higher.

e.g. If we estimate memory usage as 1000 * n, we can estimate how much
memory is available for additional block-related sampling. If we block
sample 10% of all blocks selected for row samples, then this will use at
most BLCKSZ * n /10 additional memory. 

I'm not suggesting (yet) we follow the suggestion of Chaudhuri et al for
adaptive page/block sampling, since we do not actually compare values
retrieved until we sort them later. That would require more involved
changes to the current ANALYZE mechanisms. We can look at the row
lengths to ensure that the block sampled rows do not completely swamp
the row sampled ones.

b). look at individual block clustering using the tupnoLink and the
tuple TIDs. If the tuple TID == tupnoLink[tuple]->TID then they are from
the same block. This will allow us to spot block clustered data even
when the data is not highly correlated because of the effects of
deletion and insertion on steady state tables. If clustering appears to
exist, apply the best block estimator from Brutlag & Richardson [2000];
if not, stick with Haas/Stokes.

3. We should also apply multi-column heuristics to the estimation of D,
once we have estimated all columns. For column groups (pairs, triples
etc) that form part of a PK, we know that it must be true that D1 *
D2 ... Dk >= N. In many cases we will be very confident of our estimate
of D when we decide = d. i.e. When we have two columns, we can use this
to infer that D1 = N/d when D2 = d. So we can do this in any case where
we have confident estimates of all but one column; the required
information is available at that time.
e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
that there are at most 10 l_linenumber values in the table, then there
should be N/10 values for l_orderkey, so set it to that if it is lower
(only).

4. There are a number of other heuristics we could use but those become
more specialised and complex as we proceed down that path.

5. We should implement hash-table overflow logic for HashAggregates just
as has been done for HashJoins. The overflow logic in place merely copes
with the problem, which could mean that some queries run for very
extended durations. We should emit a log message advising when the
overflow logic subdivides the hash table more than twice (i.e. we have
underestimated D by more than x4), so that admins can take action and/or
-hackers gets to find out when badness occurs somewhere.

6. A set of functions to override n_distinct when we wish to do this,
allowing database designers to set some of these values by definition,
rather than allowing for ANALYZE to discover them. This would require
another column on pg_statistic so that the override value is never
overridden itself by subsequent ANALYZEs.
=================================================================





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: postmaster/postgres options assimilation plan
Next
From: Josh Berkus
Date:
Subject: Re: Inconsistent syntax in GRANT