Thread: Improving N-Distinct estimation by ANALYZE

Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
Improving N-Distinct estimation
===============================
v1.1

OBJECTIVES

Answer these points...

- Are there any performance issues that can be directly attributed to
mis-estimation of N-Distinct ("D") by the ANALYZE command? 

- If so, can we do better than we currently achieve? How?

- Would altering the estimate of D cause problems in other places?

Comments are sought on the problem and the possible solutions. Please
get involved if you can help with detailed analyses on this topic.

SUMMARY

The estimation of D is difficult and imprecise. The current method works
well in many cases, yet breaks down *badly* in one particular very
common use case of database design: a large dependent table with a
multi-column Primary Key. In some cases the estimate of D *decreases* as
the size of the table increases, though the estimate of D is an
underestimate in almost all cases, whatever the table design.

PostgreSQL cvstip currently seriously under estimates D for very large
dependent tables with rows that are clustered around one of the columns
of a multi-column Primary Key. The mis-estimation identified can lead to
poor system performance should the mis-estimation lead to the use of
HashAggregate or general mis-selection of optimal join plans.

An example of this is the orders and lineitem tables from DBT-3/TPC-H,
which have a 1:M relationship. There are M lineitems associated with
each order and all are inserted in one transaction into lineitem. If M
is relatively large, then problems may ensue.
A problem SQL statement may be something like:       SELECT l_orderkey, sum(l_extendedprice) from lineitem;

This issue could have an impact on any large table in a 1:M relationship
where the actual number of distinct values, D, is much larger than the
sample size, n. (D >> n). This can also effect associative tables where
more than one 1:M relationship exists to separate tables, such as Fact
tables in a star schema.

The issue is alleviated by setting the column statistics target higher,
though this merely increases the size range of the table over which
problems may occur.

There are a number of ways we can improve on the current estimates,
using techniques suggested in later statistical research.

Some proposals are made and comments are sought on the problem and the
possible solutions.

It is possible that we need more than one estimate of D for various
purposes. We might potentially need a low estimate of D for use in join
planning, whereas a higher estimate to reduce the risk of hash table
operations. This approach might be taken initially to allow us to
implement improved estimators without throwing out many previously good
plans.

WHAT WE CURRENTLY DO WITH ANALYZE

Notation
D = estimate of the number of distinct values (aka n_distinct)
N = number of rows in table
n = number of rows in sample

Sampling method
* Fixed sample size, no matter how big table, following Chaudhuri et
al's 1998 paper on sample size sufficiency for statistics histograms.

* Sample blocks = sample rows = 300 * col stats target 

Results
* Count rows/value for all values observed in sample
f1 = number of unique values in sample
d = number of values in sample

* If f1 == n => assume unique: D = N and scale with N
else If f1 == 0 => assume D = d
else => apply Haas-Stokes [1998] estimator

* If D > 10% of N => scale with N

[There are a variety of techniques selected from Haas-Stokes [1998],
Chaudhuri et al [1998], Vitter and Knuth. Sometimes these authors have
discussed the same subject and come up with different answers, so you
need to be careful to say which reference you mean when discussing
these.]

ISSUES

1. Estimation of D; mentioned above and covered in more detail below.
(see ESTIMATES OF D FOR DEPENDENT TABLES)

2. The sample size calculation correctly follows Chaudhuri et al [1998]
when the number of rows in the table is 1 million. However, smaller
tables are overestimated and larger tables are underestimated. The
sample size should be multiplied by 2.3 (i.e. ln(10)) for every x10
larger table size. e.g. a 100 million row table requires sample size 4.6
times larger to have the same accuracy for histogram selection.


OBSERVATIONS

1. *All* methods of statistical analysis are improved by larger sample
fractions. The D estimator method currently in use shows an optimum of
accuracy and sample fraction at around 5% of a table, as shown in the
author's original paper [Haas Stokes (1998)]. The current
implementation's error rates climb higher as table size increases.

2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
row sampling technique rather than a block sampling technique. This
would translate directly into a performance drop from large sample
ratios, but since we currently use a fixed sample size this problem is
not yet visible for larger tables. With a 2GB table, we would typically
sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

3. Large values of statistics target (up to 1000) could cause a number
of problems with statistics catalog table growth (mentioned on
-perform). Setting these values appropriately can take significant
effort. Automatic scaling of such parameters is desirable.

4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
nor does it use less if it is set low. Actual memory usage isn't
measured at all. With very long rows this is max 24 MB with default
stats target settings and BLCKSZ, or 2.4 GB with highest stats target
(1000). This probably is of lower importance since stats targets are
only usually set higher on larger databases, which typically have larger
memory configurations anyway - and very long rows are uncommon because
of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default
settings i.e. maybe as low as a 0.01% sample for a very large table.

ESTIMATES OF D FOR DEPENDENT TABLES

Lets look at this example from DBT-3/TPC-H:

Two tables, orders and lineitem. Each row in orders has M rows in
lineitem, so they have a 1:M relationship.

create table orders (o_orderkey integer, o_custkey integer,
o_orderstatus char(1), o_totalprice numeric(12,2), o_orderdate date,
o_orderpriority char(15), o_clerk char(15), o_shippriority integer,
o_comment varchar(79), primary key ( o_orderkey ) );

create table lineitem (l_orderkey integer, l_partkey integer,
l_suppkey integer, l_linenumber integer, l_quantity numeric(12,2),
l_extendedprice numeric(12,2), l_discount numeric(12,2), l_tax
numeric(12,2), l_returnflag char(1), l_linestatus char(1), l_shipdate
date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25),
l_shipmode char(10), l_comment varchar(44), primary key ( l_orderkey,
l_linenumber ) );

where lineitem.l_orderkey references o_orderkey

The rows in lineitem are all inserted in the same transaction that the
order is inserted. As a result, they are very likely to be in the same
data block, or adjacent data blocks (*)

ANALYZE randomly samples rows, so that the average gap between randomly
selected rows increases as the table size increases, because of the
fixed sample size. Since the clustered rows are typically close
together, then the apparent number of multiple instances of the same
data value decreases as the sample fraction decreases. Since the sample
size is currently fixed, this means that the D estimate decreases as the
table size increases. (This is proven in a test case below).

(*) The only alleviation from this effect occurs when we have the FSM
full of randomly placed blocks. In that case, we will sometimes get
consecutively INSERTed orderline rows in blocks that are wide apart
within the table. However in our example, the fulfillment of orders is
not random and so blocks with freespace tend to appear first at the
beginning of the table and cycle through the table over time. So, even
in the unlikely event that we have rows with the same l_orderkey value
widely separated in the table, we are still unlikely to actually observe
that when sample fraction is low. Data Warehouse applications seldom
delete rows, hence the FSM is unused, so this slight cause for hope is
unlikely to exist.

There is a further effect of concern here. We currently apply a "lift"
to the D estimate when we think that the number of distinct values is
high enough to imply that it should be scaled according to N. In the
above example, when sample ratio is too small, D will hit the point
where it is too low to be scaled and we suddenly "bomb out" to a much
lower value.

Test results on a table constructed as follows:
Number of orderkey vals Rows per orderkey
200,000             2
200,000             4
200,000             6
200,000             8
200,000             10
Total orderkey values = 1,000,000 (D-exact)
Total rows = 6,000,000

orderkey stats target D estimates
10 (10% blocks, 3k rows read)   106k, 113k, 114k
20 (20% blocks, 6k rows read)   201k, 185k, 211k
30 (30% blocks, 9k rows read)   301k, 303k, 324k
40 (40% blocks, 12k rows read)  431k, 378k
60 (60% blocks, 18k rows read)  530k
80 (80% blocks, 24k rows read)  646k(-0.107), 732k(-0.122)
100 (all blocks, 30k rows read) 823k(-0.137), 782k(-0.13), 785k(-0.13)
200 (all blocks, 60k rows read) 794k(-0.132), 810k(-0.135)

The numbers in brackets denote that we have inferred scaling by N should
occur and that we can estimate D as -(1/n_distinct).

The test uses increasing stats target to increase sample size. I assume
that increasing the table size while maintaining the data distribution,
yet maintaining constant sample would have the same effect as the
results shown here, since they would offer similar sample fractions.

The results show that D is *inversely* proportional to sample fraction,
but only over the middle range of sample sizes. At each end of the
sample size scale, we handle matters correctly. If we sample enough, we
recognise the actual distribution and decide to scale the result. If we
sample a very small fraction and this reveals an apparently unique
distribution, we decide that the distribution itself is unique and we
suddenly decide D = N.

The test results don't seem too bad if you view the estimate of D as at
most a factor of 10 wrong. However, since the error scales up with the
size of the table, we can imagine very large estimation errors.

The results show that even when we scan 60% of the example table we
consistently get an answer around half of the actual value of D. This is
misleading because we are using row sampling, so we have sampled 1800
blocks out of 29412, yet only examined 1800 rows out of 6 million. So
we've paid the high I/O cost to scan the table, but not got the benefit
from that. That could lead to the conclusion that increasing sample size
has little or no benefit, but that is only true if we use row rather
than block sampling techniques.

We do currently already calculate the correlation for each attribute, so
this could be used to influence the results. However, correlation value
itself may be poor in a steady state table such as line_item since the
filling of data blocks is likely to be cyclic.

REVIEW OF RESEARCH

Andrew Dunstan points out Brutlag & Richardson [2000] "A Block Sampling
Approach to Distinct Value Estimation". In this paper a number of
different techniques are reviewed. [Some previous block-based estimators
have been proposed on-list, though these had less statistical basis and
the range of application was based on apriori knowledge of D]

>From it we note that block sampling is an effective technique when the
attributes are genuinely randomly distributed within the table. Block
sampling is also effective at identifying clustered (as opposed to
merely correlated) attributes with relatively low sample fractions.

Chaudhuri's estimator is based on a least risk approach, rather than a
greatest accuracy approach, which does sound appealing should we not be
able to apply an improved estimator.

Haas & Stokes [1998] note that "Initial experiments indicated that the
reduction in estimation error due to using the high-order jack-knife is
outweighed by the increase in error due to uncertainty in the moment
estimates." ...Experiments both here and by Brutlag & Richardson show
that not doing so leads to poor results with clustered and correlated
data.

Brutlag & Richardson conclude that "a hybrid estimator" may yield a
better approach than the ones they have come up with.

Chaudhuri et al [1998] mention an "adaptive page sampling" approach.

PROPOSALS

The current sampling approach visits many blocks yet retrieves few rows.

We know that:
i) block sampling is an effective means of detecting clustered data
ii) block sampling could lead to bias in the conclusions reached
ii) the current ANALYZE approach has problems with clustered data

I propose

1. decide that D should scale with N if the |correlation| > an arbitrary
limit [0.99 suggested to ensure only highly certain correlations are
taken] (and only if we have not already decided that D > d). This alone
would "fix" the estimate of D for all of the cases highlighted in my
test case, above. However, my test was both correlated and clustered -
and many real cases would be clustered only (say if people are reading
values from a sequence before insertion the correlation would not be so
near perfect).

2. When 
i) we have a large table (> 1 000 000 rows if known, or > 50 000 * max
stats target blocks)
ii) we have ample additional memory to collect a larger sample size

that we make these relatively minor changes to the current approach. 

a). we collect a sample that consists of both block and row sampled
data. Any estimator applied to this data will then naturally be a hybrid
estimator. This would be done in such a way that no new I/O would be
incurred, i.e. we would block sample *some* of the blocks which we will
read for row sampling purposes. Doing that will increase our knowledge
of clustered data, as well as lifting the sample size as is required to
maintain the accuracy of histogram calculation.

This means we will still use a fixed size sample, but the sample size
will be larger according to maintenance_work_mem. The sample will still
fit in memory, so the required sorts will still be fast memory sorts.
Also, the sample size is larger without also increasing the statistics
targets for particular columns and this will happen automatically when
maintenance_work_mem is set higher.

e.g. If we estimate memory usage as 1000 * n, we can estimate how much
memory is available for additional block-related sampling. If we block
sample 10% of all blocks selected for row samples, then this will use at
most BLCKSZ * n /10 additional memory. 

I'm not suggesting (yet) we follow the suggestion of Chaudhuri et al for
adaptive page/block sampling, since we do not actually compare values
retrieved until we sort them later. That would require more involved
changes to the current ANALYZE mechanisms. We can look at the row
lengths to ensure that the block sampled rows do not completely swamp
the row sampled ones.

b). look at individual block clustering using the tupnoLink and the
tuple TIDs. If the tuple TID == tupnoLink[tuple]->TID then they are from
the same block. This will allow us to spot block clustered data even
when the data is not highly correlated because of the effects of
deletion and insertion on steady state tables. If clustering appears to
exist, apply the best block estimator from Brutlag & Richardson [2000];
if not, stick with Haas/Stokes.

3. We should also apply multi-column heuristics to the estimation of D,
once we have estimated all columns. For column groups (pairs, triples
etc) that form part of a PK, we know that it must be true that D1 *
D2 ... Dk >= N. In many cases we will be very confident of our estimate
of D when we decide = d. i.e. When we have two columns, we can use this
to infer that D1 = N/d when D2 = d. So we can do this in any case where
we have confident estimates of all but one column; the required
information is available at that time.
e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
that there are at most 10 l_linenumber values in the table, then there
should be N/10 values for l_orderkey, so set it to that if it is lower
(only).

4. There are a number of other heuristics we could use but those become
more specialised and complex as we proceed down that path.

5. We should implement hash-table overflow logic for HashAggregates just
as has been done for HashJoins. The overflow logic in place merely copes
with the problem, which could mean that some queries run for very
extended durations. We should emit a log message advising when the
overflow logic subdivides the hash table more than twice (i.e. we have
underestimated D by more than x4), so that admins can take action and/or
-hackers gets to find out when badness occurs somewhere.

6. A set of functions to override n_distinct when we wish to do this,
allowing database designers to set some of these values by definition,
rather than allowing for ANALYZE to discover them. This would require
another column on pg_statistic so that the override value is never
overridden itself by subsequent ANALYZEs.
=================================================================





Re: Improving N-Distinct estimation by ANALYZE

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> [ ... a large amount of analysis based on exactly one test case ... ]

I think you are putting too much emphasis on fixing one case and not
enough on considering what may happen in other cases ...

In general, estimating n-distinct from a sample is just plain a hard
problem, and it's probably foolish to suppose we'll ever be able to
do it robustly.  What we need is to minimize the impact when we get
it wrong.  So I agree with the comment that we need to finish the
unfinished project of making HashAggregate tables expansible, but
I'm dubious about the rest of this.
        regards, tom lane


Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Tom,

> In general, estimating n-distinct from a sample is just plain a hard
> problem, and it's probably foolish to suppose we'll ever be able to
> do it robustly.  What we need is to minimize the impact when we get
> it wrong.  

Well, I think it's pretty well proven that to be accurate at all you need 
to be able to sample at least 5%, even if some users choose to sample 
less.   Also I don't think anyone on this list disputes that the current 
algorithm is very inaccurate for large tables.  Or do they?

While I don't think that we can estimate N-distinct completely accurately, 
I do think that we can get within +/- 5x for 80-90% of all cases, instead 
of 40-50% of cases like now.  We can't be perfectly accurate, but we can 
be *more* accurate.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving N-Distinct estimation by ANALYZE

From
"Jim C. Nasby"
Date:
On Wed, Jan 04, 2006 at 07:10:29PM +0000, Simon Riggs wrote:
> 3. We should also apply multi-column heuristics to the estimation of D,
> once we have estimated all columns. For column groups (pairs, triples
> etc) that form part of a PK, we know that it must be true that D1 *
> D2 ... Dk >= N. In many cases we will be very confident of our estimate
> of D when we decide = d. i.e. When we have two columns, we can use this
> to infer that D1 = N/d when D2 = d. So we can do this in any case where
> we have confident estimates of all but one column; the required
> information is available at that time.
> e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
> that there are at most 10 l_linenumber values in the table, then there
> should be N/10 values for l_orderkey, so set it to that if it is lower
> (only).

Sorry if I'm pointing out the obwious, but I would do this for any 
unique index, not just a PK. (It should still hold for any unique index,
right?)

Also, was an approach of sampling random rows within random blocks
considered? Something like:

until row sample size reached   read random block   sample x% of rows in that block randomly
done
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Simon,

> - Are there any performance issues that can be directly attributed to
> mis-estimation of N-Distinct ("D") by the ANALYZE command?

Yes.   There's at least one query (maybe two) from TPC-H which bombs
because of bad N-distinct estimation, even with stats_target =1000.  Based
on my experience with data warehouses, this also occurs in the field.

> - If so, can we do better than we currently achieve? How?

Replace the current algorithm and broaden the sample.

> - Would altering the estimate of D cause problems in other places?

Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty
"proper" use of D right now, it's just that for some common cases our
estimates of D are bad.

> The estimation of D is difficult and imprecise. The current method works
> well in many cases, yet breaks down *badly* in one particular very
> common use case of database design: a large dependent table with a
> multi-column Primary Key. In some cases the estimate of D *decreases* as
> the size of the table increases, though the estimate of D is an
> underestimate in almost all cases, whatever the table design.

Actually, the current estimator underestimates D for *any* large table's
high-cardinality columns, primary key, multi-column, or not.
Chaudhuri's calculation seems to be designed to yield the smallest
number of cardinal values that could reasonably be expected to yield the
provided sample.   That is, if the estimate range within a stdev of 2.0
gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.

This conservative approach makes sense when you're only considering join
strategies.  That is, given an unreliable estimate you want to estimate
D low so that you don't wrongly choose a nested loop, the cost for which
mistake being much higher than the cost of performing an unnecessary
hash join.   It's "conservative" in that sense.

However,   PostgreSQL now has a whole set of hash operations and other
query types for which a
too-low estimate of D causes query lockup.   So for these operations,
Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my
testing with TPCH showed that estimate error is usually OK within +/-
5x.  Beyond that any you start to get bad query plans.

(yes, I know all of the above begs examples.  I'm swamped.   I believe I
posted examples when I first started talking about n-distinct estimation a
year ago)

So I think it's vital that we look at algorithms designed to deliver us
the median estimated D, not the lowest reasonable, in addition to
increasing sample size.  The block-based estimator functions which
Andrew and I looked at seem designed to do that provided a sample of
between 1% and 10%.

> 1. *All* methods of statistical analysis are improved by larger sample
> fractions. The D estimator method currently in use shows an optimum of
> accuracy and sample fraction at around 5% of a table, as shown in the
> author's original paper [Haas Stokes (1998)]. The current
> implementation's error rates climb higher as table size increases.

I read 5 different papers on ACM about sampling D.  All of them were
united in saying that you couldn't get even slightly accurate estimates
with less than 3% sampling.

> 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
> row sampling technique rather than a block sampling technique. This
> would translate directly into a performance drop from large sample
> ratios, but since we currently use a fixed sample size this problem is
> not yet visible for larger tables. With a 2GB table, we would typically
> sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

This woudl be a reason to use block-sampling ONLY, rather than hybrid
sampling.

> 3. Large values of statistics target (up to 1000) could cause a number
> of problems with statistics catalog table growth (mentioned on
> -perform). Setting these values appropriately can take significant
> effort. Automatic scaling of such parameters is desirable.

Well, decoupling the n-distinct sample size from the # of heuristics rows
would fix that.

> 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
> nor does it use less if it is set low. Actual memory usage isn't
> measured at all. With very long rows this is max 24 MB with default
> stats target settings and BLCKSZ, or 2.4 GB with highest stats target
> (1000). This probably is of lower importance since stats targets are
> only usually set higher on larger databases, which typically have larger
> memory configurations anyway - and very long rows are uncommon because
> of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default
> settings i.e. maybe as low as a 0.01% sample for a very large table.

Yeah, I think I pointed out that ANALYZE doesn't seem to be using
maintenance_mem a year ago.   It should, although there should be options
to use less than maintenance_mem if the user wants low-impact analyze.

> There is a further effect of concern here. We currently apply a "lift"
> to the D estimate when we think that the number of distinct values is
> high enough to imply that it should be scaled according to N. In the
> above example, when sample ratio is too small, D will hit the point
> where it is too low to be scaled and we suddenly "bomb out" to a much
> lower value.

Yes, this is another problem with the current algorithm.  This kind of
"thresholding" is, well, hackish.  More importantly, it leads to
unpredictable query behavior as a query on a table only 10 rows larger
yeilds a radically different plan at the edge of the threshold.

> The test results don't seem too bad if you view the estimate of D as at
> most a factor of 10 wrong. However, since the error scales up with the
> size of the table, we can imagine very large estimation errors.

Yes.  My tests showed that for a tpch of 100G, with 600 million rows in
Lineitem, D was an average of 30x low and could not be less than 10x low
even with the luckiest sample.  This misestimate gets worse as the table
gets larger.

> Chaudhuri's estimator is based on a least risk approach, rather than a
> greatest accuracy approach, which does sound appealing should we not be
> able to apply an improved estimator.

As I point out above, though, Chaudhuri's understanding of "least risk"
is flawed.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Wed, 2006-01-04 at 14:49 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > [ ... a large amount of analysis based on exactly one test case ... ]

[Hmmm, those are your opinions, not my words. Funny guy ;-) ]

The one test case just happens to be a very common 1:M relationship, an
example of which occurs within the TPC-H/DBT-3 test database. I picked
that so it was an obvious publicly accessible test case, rather than an
isolated customer issue that couldn't be aired fully. I was hoping to
allow the problem to be explored and improved.

> I think you are putting too much emphasis on fixing one case and not
> enough on considering what may happen in other cases ...

It's not just one problem, but knowing that you would accept only a
detailed analysis, I researched that one case so that it was
indisputable. 

> I'm dubious about the rest of this.

Excellent. Much better than "It's Wrong." I'll write some code and run some tests.

Thanks for reading it all; sorry it had to be so long.

Best Regards, Simon Riggs




Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Tom,
> 
> > In general, estimating n-distinct from a sample is just plain a hard
> > problem, and it's probably foolish to suppose we'll ever be able to
> > do it robustly.  What we need is to minimize the impact when we get
> > it wrong.  
> 
> Well, I think it's pretty well proven that to be accurate at all you need 
> to be able to sample at least 5%, even if some users choose to sample 
> less.   Also I don't think anyone on this list disputes that the current 
> algorithm is very inaccurate for large tables.  Or do they?

I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Wed, 2006-01-04 at 17:57 -0600, Jim C. Nasby wrote:
> On Wed, Jan 04, 2006 at 07:10:29PM +0000, Simon Riggs wrote:
> > 3. We should also apply multi-column heuristics to the estimation of D,
> > once we have estimated all columns. For column groups (pairs, triples
> > etc) that form part of a PK, we know that it must be true that D1 *
> > D2 ... Dk >= N. In many cases we will be very confident of our estimate
> > of D when we decide = d. i.e. When we have two columns, we can use this
> > to infer that D1 = N/d when D2 = d. So we can do this in any case where
> > we have confident estimates of all but one column; the required
> > information is available at that time.
> > e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
> > that there are at most 10 l_linenumber values in the table, then there
> > should be N/10 values for l_orderkey, so set it to that if it is lower
> > (only).
> 
> Sorry if I'm pointing out the obwious, but I would do this for any 
> unique index, not just a PK. (It should still hold for any unique index,
> right?)

Yes. It's just a less common case to have > 1 unique indexes.

> Also, was an approach of sampling random rows within random blocks
> considered? Something like:
> 
> until row sample size reached
>     read random block
>     sample x% of rows in that block randomly
> done

Yes. 

I was trying to maintain the existing approach as much as possible,
rather than ripping everything out and starting again which could cause
just as many problems as it solves. So evolution, rather than
revolution.

The approach I suggested uses the existing technique for selecting
random blocks, then either an exhaustive check on all of the rows in a
block or the existing random row approach, depending upon available
memory. We need to check all of the rows in a reasonable sample of
blocks otherwise we might miss clusters of rows in large tables - which
is the source of the problems identified.

The other reason was to increase the sample size, which is a win in any
form of statistics.

Best Regards, Simon Riggs



Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need. 

Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.

I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).

Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good. 
> > 
> >  http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> > 
> > It does require a single full table scan 

The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.

Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.

Best Regards, Simon Riggs



Re: Improving N-Distinct estimation by ANALYZE

From
Trent Shipley
Date:
Sorry to interupt.  The discussion is interesting, but I need some help to
follow along.

On Wednesday 2006-01-04 17:07, Josh Berkus wrote:
> Simon,
>
> > - Are there any performance issues that can be directly attributed to
> > mis-estimation of N-Distinct ("D") by the ANALYZE command?
>
> Yes.   There's at least one query (maybe two) from TPC-H which bombs
> because of bad N-distinct estimation, even with stats_target =1000.  Based
> on my experience with data warehouses, this also occurs in the field.
>
> > - If so, can we do better than we currently achieve? How?
>
> Replace the current algorithm and broaden the sample.

Is "replace the algorithm" the same as saying "contextually use some estimate
of D that is not Chaudhuri?

> > - Would altering the estimate of D cause problems in other places?
>
> Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty
> "proper" use of D right now, it's just that for some common cases our
> estimates of D are bad.
>
> > The estimation of D is difficult and imprecise. The current method works
> > well in many cases, yet breaks down *badly* in one particular very
> > common use case of database design: a large dependent table with a
> > multi-column Primary Key. In some cases the estimate of D *decreases* as
> > the size of the table increases, though the estimate of D is an
> > underestimate in almost all cases, whatever the table design.
>
> Actually, the current estimator underestimates D for *any* large table's
> high-cardinality columns, primary key, multi-column, or not.
> Chaudhuri's calculation seems to be designed to yield the smallest
> number of cardinal values that could reasonably be expected to yield the
> provided sample.   That is, if the estimate range within a stdev of 2.0
> gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.
>
> This conservative approach makes sense when you're only considering join
> strategies.  That is, given an unreliable estimate you want to estimate
> D low so that you don't wrongly choose a nested loop, the cost for which
> mistake being much higher than the cost of performing an unnecessary
> hash join.   It's "conservative" in that sense.

So Chaudhuri's estimate of D is appropriate (and is working) when making
decisions about joins.

> However,   PostgreSQL now has a whole set of hash operations and other
> query types for which a
> too-low estimate of D causes query lockup.

Why?

> So for these operations,
> Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my
> testing with TPCH showed that estimate error is usually OK within +/-
> 5x.  Beyond that any you start to get bad query plans.
>
> (yes, I know all of the above begs examples.  I'm swamped.   I believe I
> posted examples when I first started talking about n-distinct estimation a
> year ago)
>
> So I think it's vital that we look at algorithms designed to deliver us
> the median estimated D, not the lowest reasonable, in addition to
> increasing sample size.  The block-based estimator functions which
> Andrew and I looked at seem designed to do that provided a sample of
> between 1% and 10%.

Do you *really* want the median estimate in these case?  Are you certain you
do not want something with the opposite behavior of Chaudhuri's estimate so
that for small sample sizes the bias is toward a high estimate of D?
(Converges on D from the right instead of the left.)

Chaudhuri's <-----D------------------> needed
Estimate                               estimate

> > 1. *All* methods of statistical analysis are improved by larger sample
> > fractions. The D estimator method currently in use shows an optimum of
> > accuracy and sample fraction at around 5% of a table, as shown in the
> > author's original paper [Haas Stokes (1998)]. The current
> > implementation's error rates climb higher as table size increases.
>
> I read 5 different papers on ACM about sampling D.  All of them were
> united in saying that you couldn't get even slightly accurate estimates
> with less than 3% sampling.

These statements are at odds with my admittedly basic understanding of
statistics.  Isn't the power of a sample more related to the absolute size of
the sample than the sample as fraction of the population?  Why not just pick
a smallish sample size, say about 3000, and apply it to all the tables, even
the ones with just a single row (modify appropriately from block sampling).

> > 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
> > row sampling technique rather than a block sampling technique. This
> > would translate directly into a performance drop from large sample
> > ratios, but since we currently use a fixed sample size this problem is
> > not yet visible for larger tables. With a 2GB table, we would typically
> > sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.
>
> This woudl be a reason to use block-sampling ONLY, rather than hybrid
> sampling.
>
> > 3. Large values of statistics target (up to 1000) could cause a number
> > of problems with statistics catalog table growth (mentioned on
> > -perform). Setting these values appropriately can take significant
> > effort. Automatic scaling of such parameters is desirable.
>
> Well, decoupling the n-distinct sample size from the # of heuristics rows
> would fix that.


Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> The approach I suggested uses the existing technique for selecting
> random blocks, then either an exhaustive check on all of the rows in a
> block or the existing random row approach, depending upon available
> memory. We need to check all of the rows in a reasonable sample of
> blocks otherwise we might miss clusters of rows in large tables - which
> is the source of the problems identified.
> 
> The other reason was to increase the sample size, which is a win in any
> form of statistics.

Only if your sample is random and independent. The existing mechanism tries
fairly hard to ensure that every record has an equal chance of being selected.
If you read the entire block and not appropriate samples then you'll introduce
systematic sampling errors. For example, if you read an entire block you'll be
biasing towards smaller records.

I think it would be useful to have a knob to increase the sample size
separately from the knob for the amount of data retained in the statistics
tables. Though I think you'll be disappointed and find you have to read an
unreasonably large sample out of the table before you get more useful distinct
estimates.

Certainly it's worth testing this in a low impact way like just keeping the
existing sample method and dialing up the sample sizes before you try anything
that would sacrifice the statistical validity of the more solid estimates.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Trent,

> Sorry to interupt.  The discussion is interesting, but I need some help to 
> follow along.

Thought-out commentary is welcome.


> Is "replace the algorithm" the same as saying "contextually use some estimate 
> of D that is not Chaudhuri?

Yes.  I favor a block-based approach like Brutlag, largely because it 
allows us to increase the sample size without dramatically increasing I/O.

> So Chaudhuri's estimate of D is appropriate (and is working) when making 
> decisions about joins.

Some kinds of joins.   It avoids, for example, risky use of nested loops.

>>However,   PostgreSQL now has a whole set of hash operations and other
>>query types for which a
>>too-low estimate of D causes query lockup.
> 
> 
> Why?  

Two specific examples, both of which I've encountered in the field:

1) too-low D will cause an aggregate query to use a hashagg which is 
larger than memory resulting in swapping (or disk spill when it's fixed) 
which makes the query very much slower than if the hashagg was not used.

2) much too-low D will cause the planner to pick a seq scan when it's 
not necessary, resulting in increased query times.


> Do you *really* want the median estimate in these case?  Are you certain you 
> do not want something with the opposite behavior of Chaudhuri's estimate so 
> that for small sample sizes the bias is toward a high estimate of D? 
> (Converges on D from the right instead of the left.)
> 
> Chaudhuri's <-----D------------------> needed
> Estimate                               estimate

Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
deterime for each query operation whether a too-low D or a too-high D 
was more risky, and then use the more conservative number.   However, 
that would complicate the query planner enough that I think Tom would 
leave us. :-p


> These statements are at odds with my admittedly basic understanding of 
> statistics.  Isn't the power of a sample more related to the absolute size of 
> the sample than the sample as fraction of the population?  Why not just pick 
> a smallish sample size, say about 3000, and apply it to all the tables, even 
> the ones with just a single row (modify appropriately from block sampling).

Nope, it's definitely proportional.   As a simple example, a sample of 
500 rows in a table of 1000 rows should yeild stats estimates with 90%+ 
accuracy.  But a sample of 500 rows in a 600,000,000 row table is so 
small as to be nearly useless; it's quite possible to get all the same 
value in a random sample of < 0.1% even on a column with a D/N of 0.001.    If you look at the papers cited, almost all
researchersmore recent 
 
than Chaudhuri use a proportional sample size.

And we're taking the fixed-sample-size approach now, and it's not 
working very well for large tables.

--Josh Berkus


Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Greg,

> Only if your sample is random and independent. The existing mechanism tries
> fairly hard to ensure that every record has an equal chance of being selected.
> If you read the entire block and not appropriate samples then you'll introduce
> systematic sampling errors. For example, if you read an entire block you'll be
> biasing towards smaller records.

Did you read any of the papers on block-based sampling?   These sorts of 
issues are specifically addressed in the algorithms.

--Josh


Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Folks,

> Nope, it's definitely proportional.   As a simple example, a sample of 
> 500 rows in a table of 1000 rows should yeild stats estimates with 90%+ 
> accuracy.  But a sample of 500 rows in a 600,000,000 row table is so 
> small as to be nearly useless; it's quite possible to get all the same 
> value in a random sample of < 0.1% even on a column with a D/N of 0.001. 

I meant "a D/N of 0.1".  Sorry.

--Josh


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> 
> > The approach I suggested uses the existing technique for selecting
> > random blocks, then either an exhaustive check on all of the rows in a
> > block or the existing random row approach, depending upon available
> > memory. We need to check all of the rows in a reasonable sample of
> > blocks otherwise we might miss clusters of rows in large tables - which
> > is the source of the problems identified.
> > 
> > The other reason was to increase the sample size, which is a win in any
> > form of statistics.
> 
> Only if your sample is random and independent. The existing mechanism tries
> fairly hard to ensure that every record has an equal chance of being selected.
> If you read the entire block and not appropriate samples then you'll introduce
> systematic sampling errors. For example, if you read an entire block you'll be
> biasing towards smaller records.

Yes, I discussed that, following Brutlag & Richardson [2000]. The bottom
line is if there is no clustering, block sampling is random, which is
good; if there is clustering, then you spot it, which is good.

> I think it would be useful to have a knob to increase the sample size
> separately from the knob for the amount of data retained in the statistics
> tables. Though I think you'll be disappointed and find you have to read an
> unreasonably large sample out of the table before you get more useful distinct
> estimates.

OK, I'll look at doing that.

Best Regards, Simon Riggs



Re: Improving N-Distinct estimation by ANALYZE

From
Rod Taylor
Date:
> > Do you *really* want the median estimate in these case?  Are you certain you 
> > do not want something with the opposite behavior of Chaudhuri's estimate so 
> > that for small sample sizes the bias is toward a high estimate of D? 
> > (Converges on D from the right instead of the left.)
> > 
> > Chaudhuri's <-----D------------------> needed
> > Estimate                               estimate
> 
> Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
> deterime for each query operation whether a too-low D or a too-high D 
> was more risky, and then use the more conservative number.   However, 
> that would complicate the query planner enough that I think Tom would 
> leave us. :-p

You could have some specific functions vote themselves out if their cost
is shakey. We know that the cost of a miscalculated nestloop is huge, so
after calculating the common case it might apply a multiplier for the
"risk" involved.

There have been lots of requests for a way to achieve more consistent
plans that have a determined worst case performance, even if they never
perform as well in the best case as another algorithm might. Perhaps
this could be a GUC.
PlanCost + PlanCost * Risk * RiskGUC

"Risk" is a number that indicates how badly things can go wrong.

"RiskGUC" is an integer multiplier. Someone who is risk averse (wants a
predictable execution time rather than the best possible time) would set
this value high. Others who want the best possible plan in most cases
even if it performs poorly once in a while will set the value very low,
possibly 0.

-- 



Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> > Only if your sample is random and independent. The existing mechanism tries
> > fairly hard to ensure that every record has an equal chance of being selected.
> > If you read the entire block and not appropriate samples then you'll introduce
> > systematic sampling errors. For example, if you read an entire block you'll be
> > biasing towards smaller records.
> 
> Did you read any of the papers on block-based sampling?   These sorts of issues
> are specifically addressed in the algorithms.

We *currently* use a block based sampling algorithm that addresses this issue
by taking care to select rows within the selected blocks in an unbiased way.
You were proposing reading *all* the records from the selected blocks, which
throws away that feature.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics.  Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population?  Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
> 
> Nope, it's definitely proportional.   As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 
 SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Greg,

> We *currently* use a block based sampling algorithm that addresses this
> issue by taking care to select rows within the selected blocks in an
> unbiased way. You were proposing reading *all* the records from the
> selected blocks, which throws away that feature.

The block-based algorithms have specific math to cope with this.   Stuff 
which is better grounded in statistical analysis than our code.   Please 
read the papers before you judge the solution.

> Worse, my recollection from the paper I mentioned earlier was that
> sampling small percentages like 3-5% didn't get you an acceptable
> accuracy. Before you got anything reliable you found you were sampling
> very large percentages of the table. And note that if you have to sample
> anything over 10-20% you may as well just read the whole table. Random
> access reads are that much slower.

Right, which is why researchers are currently looking for something better.  
The Brutlag & Richardson claims to be able to produce estimates which are 
within +/- 3x 90% of the time using a 5% sample, which is far better than 
our current accuracy.  Nobody claims to be able to estimate based on < 
0.1% of the table, which is what Postgres tries to do for large tables.

5% based on block-based sampling is reasonable; that means a straight 5% of 
the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
sampling, that would require as much as 25% of the table, making it easier 
to just scan the whole thing.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving N-Distinct estimation by ANALYZE

From
"Jim C. Nasby"
Date:
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote:
> Worse, my recollection from the paper I mentioned earlier was that sampling
> small percentages like 3-5% didn't get you an acceptable accuracy. Before you
> got anything reliable you found you were sampling very large percentages of
> the table. And note that if you have to sample anything over 10-20% you may as
> well just read the whole table. Random access reads are that much slower.

If I'm reading backend/commands/analyze.c right, the heap is accessed
linearly, only reading blocks that get selected but reading them in heap
order, which shouldn't be anywhere near as bad as random access.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Greg,
> 
> > We *currently* use a block based sampling algorithm that addresses this
> > issue by taking care to select rows within the selected blocks in an
> > unbiased way. You were proposing reading *all* the records from the
> > selected blocks, which throws away that feature.
> 
> The block-based algorithms have specific math to cope with this.   Stuff 
> which is better grounded in statistical analysis than our code.   Please 
> read the papers before you judge the solution.

I'm confused since Postgres's current setup *was* based on papers on just this
topic. I haven't read any of the papers myself, I'm just repeating what was
previously discussed when the current block sampling code went in. There was
extensive discussion of the pros and cons of different algorithms taken from
various papers and how they related to Postgres. I don't recall any of them
suggesting that there was any way to take a sample which included every row
from a sampled block and then somehow unbias the sample after the fact.

> Right, which is why researchers are currently looking for something better.  
> The Brutlag & Richardson claims to be able to produce estimates which are 
> within +/- 3x 90% of the time using a 5% sample, which is far better than 
> our current accuracy.  Nobody claims to be able to estimate based on < 
> 0.1% of the table, which is what Postgres tries to do for large tables.
> 
> 5% based on block-based sampling is reasonable; that means a straight 5% of 
> the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
> sampling, that would require as much as 25% of the table, making it easier 
> to just scan the whole thing.

Postgres's current sample sizes are clearly geared towards the histograms
where they are entirely realistic. All of the distinct estimates are clearly
just ad hoc attempts based on the existing sampling. 

Is a mechanism that is only 5x faster than reading the whole table (assuming
random_page_cost of 4) and is off by more than a factor of three 10% of the
time really worth it?

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
"Jim C. Nasby"
Date:
On Fri, Jan 06, 2006 at 01:24:41AM -0500, Greg Stark wrote:
> > 5% based on block-based sampling is reasonable; that means a straight 5% of 
> > the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
> > sampling, that would require as much as 25% of the table, making it easier 
> > to just scan the whole thing.
> 
> Postgres's current sample sizes are clearly geared towards the histograms
> where they are entirely realistic. All of the distinct estimates are clearly
> just ad hoc attempts based on the existing sampling. 
> 
> Is a mechanism that is only 5x faster than reading the whole table (assuming
> random_page_cost of 4) and is off by more than a factor of three 10% of the
> time really worth it?

Before we start debating merits of proposals based on random reads, can
someone confirm that the sampling code actually does read randomly? I
looked at it yesterday; there is a comment that states that blocks to be
scanned are passed to the analyze function in physical order, and AFAICT
the function that chooses blocks does so based strictly on applying a
probability function to block numbers as it increments a counter. It
seems that any reading is actually sequential and not random, which
makes all the random_page_cost hand-waving null and void.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving N-Distinct estimation by ANALYZE

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Before we start debating merits of proposals based on random reads, can
> someone confirm that the sampling code actually does read randomly?

Well, it's not so much that it's not "random", as that it's not
sequential --- it skips blocks, and therefore you'd expect that
kernel-level read-ahead would not kick in, or at least not be very
effective.

If there weren't much else going on, you could still assume that
you'd be paying less seek cost than in a genuinely random-order
fetching of the same number of blocks.

Not sure how these effects would add up.  I agree that some
investigation would be wise before making any claims about how
expensive the current method actually is.
        regards, tom lane


Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> Before we start debating merits of proposals based on random reads, can
> someone confirm that the sampling code actually does read randomly? I
> looked at it yesterday; there is a comment that states that blocks to be
> scanned are passed to the analyze function in physical order, and AFAICT
> the function that chooses blocks does so based strictly on applying a
> probability function to block numbers as it increments a counter. It
> seems that any reading is actually sequential and not random, which
> makes all the random_page_cost hand-waving null and void.

Hm. I'm curious just how much that behaves like a sequential scan actually. I
think I'll do some experiments.

Reading 1% (1267 read, 126733 skipped):      7748264us
Reading 2% (2609 read, 125391 skipped):     12672025us
Reading 5% (6502 read, 121498 skipped):     19005678us
Reading 5% (6246 read, 121754 skipped):     18509770us
Reading 10% (12975 read, 115025 skipped):     19305446us
Reading 20% (25716 read, 102284 skipped):     18147151us
Reading 50% (63656 read, 64344 skipped):     18089229us
Reading 100% (128000 read, 0 skipped):         18173003us

These numbers don't make much sense to me. It seems like 5% is about as slow
as reading the whole file which is even worse than I expected. I thought I was
being a bit pessimistic to think reading 5% would be as slow as reading 20% of
the table.

Anyone see anything wrong my my methodology?



--
greg

Attachment

Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Greg,

> These numbers don't make much sense to me. It seems like 5% is about as
> slow as reading the whole file which is even worse than I expected. I
> thought I was being a bit pessimistic to think reading 5% would be as
> slow as reading 20% of the table.

It's about what *I* expected.  Disk seeking is the bane of many access 
methods.

Anyway, since the proof is in the pudding, Simon and I will be working on 
some demo code for different sampling methods so that we can debate 
results rather than theory.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> > These numbers don't make much sense to me. It seems like 5% is about as
> > slow as reading the whole file which is even worse than I expected. I
> > thought I was being a bit pessimistic to think reading 5% would be as
> > slow as reading 20% of the table.
> 
> It's about what *I* expected.  Disk seeking is the bane of many access 
> methods.

Sure, but that bad? That means realistic random_page_cost values should be
something more like 20 rather than 4. And that's with seeks only going to
subsequent blocks in a single file, which one would expect to average less
than the half rotation that a random seek would average. That seems worse than
anyone expects.

> Anyway, since the proof is in the pudding, Simon and I will be working on 
> some demo code for different sampling methods so that we can debate 
> results rather than theory.

Note that if these numbers are realistic then there's no i/o benefit to any
sampling method that requires anything like 5% of the entire table and is
still unreliable. Instead it makes more sense to implement an algorithm that
requires a full table scan and can produce good results more reliably.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Kenneth Marshall
Date:
On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > > These numbers don't make much sense to me. It seems like 5% is about as
> > > slow as reading the whole file which is even worse than I expected. I
> > > thought I was being a bit pessimistic to think reading 5% would be as
> > > slow as reading 20% of the table.
> > 
> > It's about what *I* expected.  Disk seeking is the bane of many access 
> > methods.
> 
> Sure, but that bad? That means realistic random_page_cost values should be
> something more like 20 rather than 4. And that's with seeks only going to
> subsequent blocks in a single file, which one would expect to average less
> than the half rotation that a random seek would average. That seems worse than
> anyone expects.
> 
> > Anyway, since the proof is in the pudding, Simon and I will be working on 
> > some demo code for different sampling methods so that we can debate 
> > results rather than theory.
> 
> Note that if these numbers are realistic then there's no i/o benefit to any
> sampling method that requires anything like 5% of the entire table and is
> still unreliable. Instead it makes more sense to implement an algorithm that
> requires a full table scan and can produce good results more reliably.
> 
> -- 
> greg
> 

I have not looked closely at the sampling process in PostgreSQL, but given
that getting an accurate answer is expensive in terms of read/seeks, would
it be possible to iteratively refine the estimate overtime? Ideally we could
use the same disk blocks that are already being read to satisfy a query.
The initial estimate could be done with a minimal impact and then we piggy-
back the I/O needed to refine the estimate on the normal query I/O. The
obvious limit would be if the entire table were scanned we would get a 100%
exact estimate of the distict value at that time, and otherwise a progressively
more accurate approximation as queries are run. This would allow us to
hide the I/O in other normal block accesses.

Ken


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > Before we start debating merits of proposals based on random reads, can
> > someone confirm that the sampling code actually does read randomly? I
> > looked at it yesterday; there is a comment that states that blocks to be
> > scanned are passed to the analyze function in physical order, and AFAICT
> > the function that chooses blocks does so based strictly on applying a
> > probability function to block numbers as it increments a counter. It
> > seems that any reading is actually sequential and not random, which
> > makes all the random_page_cost hand-waving null and void.
> 
> Hm. I'm curious just how much that behaves like a sequential scan actually. I
> think I'll do some experiments. 
> 
> Reading 1% (1267 read, 126733 skipped):      7748264us
> Reading 2% (2609 read, 125391 skipped):     12672025us
> Reading 5% (6502 read, 121498 skipped):     19005678us
> Reading 5% (6246 read, 121754 skipped):     18509770us
> Reading 10% (12975 read, 115025 skipped):     19305446us
> Reading 20% (25716 read, 102284 skipped):     18147151us
> Reading 50% (63656 read, 64344 skipped):     18089229us
> Reading 100% (128000 read, 0 skipped):         18173003us
> 
> These numbers don't make much sense to me. It seems like 5% is about as slow
> as reading the whole file which is even worse than I expected. I thought I was
> being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> the table.

Just to put a few rumours to bed:

- the current code does *not* use block sampling, it uses random row
sampling. (There is a part of the code that selects the "next block" but
that should not confuse you into thinking that the whole block is
sampled).

- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.

Best Regards, Simon Riggs



Re: Improving N-Distinct estimation by ANALYZE

From
Lukas Smith
Date:
Simon Riggs wrote:

> - yes, the random sampling is random - please read the code and comments
> 
> - yes, I would expect the results you get. If you sample 5% of rows and
> each block has on average at least 20 rows, then we should expect the
> majority of blocks to be hit.

and it seems from the benchmark posted to this list that random is 
_very_ expensive (probably because the random reads are spread out so 
well, that we do alot of I/O instead of just logical I/O from some cache)

regards,
Lukas


Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> - yes, I would expect the results you get. If you sample 5% of rows and
> each block has on average at least 20 rows, then we should expect the
> majority of blocks to be hit.

These results are from my test program. 5% means 5% of 8k blocks from the test
file. In other words, reading a random 5% of the blocks from the test file in
sequential order but seeking over the skipped blocks is just as slow as
reading the entire file.

I feel like that can't be right but I can't find anything wrong with the
methodology.

An updated program is attached with which I got these results:

bash-3.00# for i in `seq 1 100` ; do umount /u6; mount /dev/sda1 /u6; ~stark/src/pg/a.out /u6/temp/small $i ; done
Reading 1% (1280/128000 blocks 1048576000 bytes) total time 7662706us MB/s 1.37 effective MB/s 136.84u
Reading 2% (2560/128000 blocks 1048576000 bytes) total time 12495106us MB/s 1.68 effective MB/s 83.92
Reading 3% (3840/128000 blocks 1048576000 bytes) total time 15847342us MB/s 1.99 effective MB/s 66.17
Reading 4% (5120/128000 blocks 1048576000 bytes) total time 18281244us MB/s 2.29 effective MB/s 57.36
Reading 5% (6400/128000 blocks 1048576000 bytes) total time 18988843us MB/s 2.76 effective MB/s 55.22
Reading 6% (7680/128000 blocks 1048576000 bytes) total time 19225394us MB/s 3.27 effective MB/s 54.54
Reading 7% (8960/128000 blocks 1048576000 bytes) total time 19462241us MB/s 3.77 effective MB/s 53.88
Reading 8% (10240/128000 blocks 1048576000 bytes) total time 19747881us MB/s 4.25 effective MB/s 53.10
Reading 9% (11520/128000 blocks 1048576000 bytes) total time 19451411us MB/s 4.85 effective MB/s 53.91
Reading 10% (12800/128000 blocks 1048576000 bytes) total time 19546511us MB/s 5.36 effective MB/s 53.65
Reading 11% (14080/128000 blocks 1048576000 bytes) total time 18989375us MB/s 6.07 effective MB/s 55.22
Reading 12% (15360/128000 blocks 1048576000 bytes) total time 18722848us MB/s 6.72 effective MB/s 56.01
Reading 13% (16640/128000 blocks 1048576000 bytes) total time 18621588us MB/s 7.32 effective MB/s 56.31
Reading 14% (17920/128000 blocks 1048576000 bytes) total time 18581751us MB/s 7.90 effective MB/s 56.43
Reading 15% (19200/128000 blocks 1048576000 bytes) total time 18422160us MB/s 8.54 effective MB/s 56.92
Reading 16% (20480/128000 blocks 1048576000 bytes) total time 18148012us MB/s 9.24 effective MB/s 57.78
Reading 17% (21760/128000 blocks 1048576000 bytes) total time 18147779us MB/s 9.82 effective MB/s 57.78
Reading 18% (23040/128000 blocks 1048576000 bytes) total time 18023256us MB/s 10.47 effective MB/s 58.18
Reading 19% (24320/128000 blocks 1048576000 bytes) total time 18039846us MB/s 11.04 effective MB/s 58.13
Reading 20% (25600/128000 blocks 1048576000 bytes) total time 18081214us MB/s 11.60 effective MB/s 57.99
...



--
greg

Attachment

Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
> > These numbers don't make much sense to me. It seems like 5% is about as slow
> > as reading the whole file which is even worse than I expected. I thought I was
> > being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> > the table.

I have a theory. My test program, like Postgres, is reading in 8k chunks.
Perhaps that's fooling Linux into thinking it's a sequential read and reading
in 32k chunks internally. That would effectively make a 25% scan a full table
scan. And a 5% scan would be a 20% scan which is about where I would have
expected the breakeven point to be.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> > > These numbers don't make much sense to me. It seems like 5% is about as slow
> > > as reading the whole file which is even worse than I expected. I thought I was
> > > being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> > > the table.
> 
> I have a theory. My test program, like Postgres, is reading in 8k chunks.
> Perhaps that's fooling Linux into thinking it's a sequential read and reading
> in 32k chunks internally. That would effectively make a 25% scan a full table
> scan. And a 5% scan would be a 20% scan which is about where I would have
> expected the breakeven point to be.

Well my theory was sort of half right. It has nothing to do with fooling Linux
into thinking it's a sequential read. Apparently this filesystem was created
with 32k blocks. I don't remember if that was intentional or if ext2/3 did it
automatically based on the size of the filesystem.

So it doesn't have wide-ranging implications for Postgres's default 8k block
size. But it is a good lesson about the importance of not using a larger
filesystem block than Postgres's block size. The net effect is that if the
filesystem block is N*8k then your random_page_cost goes up by a factor of N.
That could be devastating for OLTP performance.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> Well my theory was sort of half right. It has nothing to do with fooling Linux
> into thinking it's a sequential read. Apparently this filesystem was created
> with 32k blocks. I don't remember if that was intentional or if ext2/3 did it
> automatically based on the size of the filesystem.
> 
> So it doesn't have wide-ranging implications for Postgres's default 8k block
> size. But it is a good lesson about the importance of not using a larger
> filesystem block than Postgres's block size. The net effect is that if the
> filesystem block is N*8k then your random_page_cost goes up by a factor of N.
> That could be devastating for OLTP performance.

Hm, apparently I spoke too soon. tune2fs says the block size is in fact 4k.
Yet the performance of the block reading test program with 4k or 8k blocks
behaves as if Linux is reading 32k blocks. And in fact when I run it with 32k
blocks I get kind of behaviour we were expecting where the breakeven point is
around 20%.

So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
It seems 32k readahead is the default for Linux, or perhaps it's the
sequential access pattern that's triggering it.

I'm trying to test it with posix_fadvise() set to random access but I'm having
trouble compiling. Do I need a special #define to get posix_fadvise from
glibc?

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote:

> So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
> It seems 32k readahead is the default for Linux, or perhaps it's the
> sequential access pattern that's triggering it.

Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when
you read sequentially, but halves the readahead if you do another access
type. Can't see that would give an average readahead size of 32k.

Anyway.... this is one just reason for change...

Best Regards, Simon Riggs



Re: Improving N-Distinct estimation by ANALYZE

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote:
> 
> > So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
> > It seems 32k readahead is the default for Linux, or perhaps it's the
> > sequential access pattern that's triggering it.
> 
> Nah, Linux 2.6 uses flexible readahead logic. It increases slowly when
> you read sequentially, but halves the readahead if you do another access
> type. Can't see that would give an average readahead size of 32k.

I've actually read this code at one point in the past. IIRC the readahead is
capped at 32k, which I find interesting given the results. Since this is
testing sequential access patterns perhaps what's happening is the test for
readahead is too liberal.  

All the numbers I'm getting are consistent with a 32k readahead. Even I run my
program with a 4k block size I get performance equivalent to a full table scan
very quickly. If I use a 32k block size then the breakeven point is just over
20%.

I suppose what I really ought to do is make some pretty graphs.

-- 
greg



Re: Improving N-Distinct estimation by ANALYZE

From
"Jim C. Nasby"
Date:
FWIW, results from a FBSD6.0 box. Between every run I ran code to zero
800MB (out of 1G) of memory, which should have effectively flushed the
OS cache (it would just put the OS into swapping).

Reading 1% (1280/128000 blocks 1048576000 bytes) total time 6870595us
MB/s 1.53 effective MB/s 152.62
Reading 2% (2560/128000 blocks 1048576000 bytes) total time 13397833us
MB/s 1.57 effective MB/s 78.26
Reading 3% (3840/128000 blocks 1048576000 bytes) total time 16288218us
MB/s 1.93 effective MB/s 64.38
Reading 4% (5120/128000 blocks 1048576000 bytes) total time 17998334us
MB/s 2.33 effective MB/s 58.26
Reading 5% (6400/128000 blocks 1048576000 bytes) total time 22958791us
MB/s 2.28 effective MB/s 45.67
Reading 6% (7680/128000 blocks 1048576000 bytes) total time 21272511us
MB/s 2.96 effective MB/s 49.29
Reading 7% (8960/128000 blocks 1048576000 bytes) total time 21708130us
MB/s 3.38 effective MB/s 48.30
Reading 8% (10240/128000 blocks 1048576000 bytes) total time 23213377us
MB/s 3.61 effective MB/s 45.17
Reading 9% (11520/128000 blocks 1048576000 bytes) total time 33641027us
MB/s 2.81 effective MB/s 31.17
Reading 10% (12800/128000 blocks 1048576000 bytes) total time 22484234us
MB/s 4.66 effective MB/s 46.64
Reading 15% (19200/128000 blocks 1048576000 bytes) total time 23985072us
MB/s 6.56 effective MB/s 43.72
Reading 20% (25600/128000 blocks 1048576000 bytes) total time 26332888us
MB/s 7.96 effective MB/s 39.82
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote:

> Anyway, since the proof is in the pudding, Simon and I will be working on
> some demo code for different sampling methods so that we can debate
> results rather than theory.

I enclose a patch for checking out block sampling. This is not
production ready, yet, but works bug-free on cvstip. Code comments have
been fully updated to explain what's going on inside.

All you need to do is "set analyze_blocks=b" and ANALYZE will switch
over to using block sampling method and will read all the rows in "b"
blocks. The sample size will also be limited by maintenance_work_mem.
(Memory limitations could be smarter). This de-couples the specification
of the sample size from the specification of the MCV/histogram size
(stats_target).
[Right now, I'm not suggesting that we have a GUC named this - it just
exists for testing. If/when we agree to allow block sampling, then we
can discuss how to invoke/specify it]

The stats calculations aren't touched - it still uses Haas-Stokes.
If you "set log_min_messages=DEBUG2" you'll also get more useful
explanations of what the variables are and what decisions it makes about
D for each column being analyzed.

This patch has two main effects:
- ANALYZE runs more than x10 faster to retrieve the same size sample
- you can specify much larger samples for bigger tables, without
increasing the stats targets

Generally, the larger samples will give better results for the
estimation. However, what is immediately apparent is that the
Haas-Stokes estimator actually gets even worse with block sampling in
the particular case I raised on-list. (Which is understandable, but not
desirable). ISTM this is a strike against Haas-Stokes, rather than a
strike against block sampling. So I'm looking at implementing the
Brutlag & Richardson estimator(s) that cope with
number-of-values-appearing in only one block. Not surprisingly that
means some non-trivial additional code to retrieve blockids for each
tuple and make decisions accordingly. I plan to use a similar technique
to the existing TupnoLink array to match blockids.

The B&R estimators should allow a fairly fast, small sample to be taken,
making this more usable for dynamic sampling during query planning (when
desirable, see recent -perform thread).

It's also worth mentioning that for datatypes that only have an "="
operator the performance of compute_minimal_stats is O(N^2) when values
are unique, so increasing sample size is a very bad idea in that case.
It may be possible to re-sample the sample, so that we get only one row
per block as with the current row sampling method. Another idea might be
just to abort the analysis when it looks fairly unique, rather than
churn through the whole sample.

Best Regards, Simon Riggs


Attachment

Re: Improving N-Distinct estimation by ANALYZE

From
Josh Berkus
Date:
Simon,

> It's also worth mentioning that for datatypes that only have an "="
> operator the performance of compute_minimal_stats is O(N^2) when values
> are unique, so increasing sample size is a very bad idea in that case.
> It may be possible to re-sample the sample, so that we get only one row
> per block as with the current row sampling method. Another idea might be
> just to abort the analysis when it looks fairly unique, rather than
> churn through the whole sample.

I'd tend to do the latter.   If we haven't had a value repeat in 25 blocks, 
how likely is one to appear later?

Hmmm ... does ANALYZE check for UNIQUE constraints?   Most unique values 
are going to have a constraint, in which case we don't need to sample them 
at all for N-distinct.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving N-Distinct estimation by ANALYZE

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> It's also worth mentioning that for datatypes that only have an "="
>> operator the performance of compute_minimal_stats is O(N^2) when values
>> are unique, so increasing sample size is a very bad idea in that case.

> Hmmm ... does ANALYZE check for UNIQUE constraints?

Our only implementation of UNIQUE constraints is btree indexes, which
require more than an "=" operator, so this seems irrelevant.
        regards, tom lane


Re: Improving N-Distinct estimation by ANALYZE

From
"Jim C. Nasby"
Date:
On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> It's also worth mentioning that for datatypes that only have an "="
> >> operator the performance of compute_minimal_stats is O(N^2) when values
> >> are unique, so increasing sample size is a very bad idea in that case.
> 
> > Hmmm ... does ANALYZE check for UNIQUE constraints?
> 
> Our only implementation of UNIQUE constraints is btree indexes, which
> require more than an "=" operator, so this seems irrelevant.

IIRC, the point was that if we know a field has to be unique, there's no
sense in doing that part of the analysis on it; you'd only care about
correlation.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 12:26 -0600, Jim C. Nasby wrote:
> On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> > >> It's also worth mentioning that for datatypes that only have an "="
> > >> operator the performance of compute_minimal_stats is O(N^2) when values
> > >> are unique, so increasing sample size is a very bad idea in that case.
> > 
> > > Hmmm ... does ANALYZE check for UNIQUE constraints?
> > 
> > Our only implementation of UNIQUE constraints is btree indexes, which
> > require more than an "=" operator, so this seems irrelevant.
> 
> IIRC, the point was that if we know a field has to be unique, there's no
> sense in doing that part of the analysis on it; you'd only care about
> correlation.

An interesting point: if we know the cardinality by definition there is
no need to ANALYZE for that column. An argument in favour of the
definitional rather than the discovery approach. As a designer I very
often already know the cardinality by definition, so I would appreciate
the opportunity to specify that to the database.

Tom has not spoken against checking for UNIQUE constraints: he is just
pointing out that there never could be a constraint in the case I was
identifying. For other datatypes we could check for them rather than
analyzing the sample, but the effect would be the same either way.

My original point was that behaviour is O(N^2) when unique; it is still
O(N^2) when nearly unique, albeit O(N^2) - O(N). Reducing stats_target
does not help there - the effect varies according to number of rows in
the sample, not num slots in MCV list.

Best Regards, Simon Riggs





Re: Improving N-Distinct estimation by ANALYZE

From
Manfred Koizar
Date:
On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
<simon@2ndquadrant.com> wrote:
>I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add "on average".  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, "previous" refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+         /* Assume that we will have rows at least 64 bytes wide 
|+          * Currently we're very unlikely to overflow availMem here 
|+          */
|+         if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))

This is a funny way of sayingif (allocrows * (sizeof(Heaptuple) + 60) > allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-     if (bs.m > 0)
|+     if (bs.m > 0 )

Oops.

|+         ereport(DEBUG2,
|+             (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
|+                         stats->tupattnum,samplerows, nmultiple, f1, d)));
^                          missing space here and in some more places
 

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
ServusManfred


Re: Improving N-Distinct estimation by ANALYZE

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Tom has not spoken against checking for UNIQUE constraints: he is just
> pointing out that there never could be a constraint in the case I was
> identifying.

More generally, arguing for or against any system-wide change on the
basis of performance of compute_minimal_stats() is folly, because that
function is not used for any common datatypes.  (Ideally it'd never be
used at all.)
        regards, tom lane


Re: Improving N-Distinct estimation by ANALYZE

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote:
> On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
> <simon@2ndquadrant.com> wrote:
> >I enclose a patch for checking out block sampling.
> 
> Can't comment on the merits of block sampling and your implementation
> thereof.  

But your thoughts are welcome...

> |!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create
> 
> Linking the use of the Vitter algorithm to May 2004 is not quite
> appropriate.  We introduced two stage sampling at that time.

I'll redo some of the comments as you suggest and review coding points.

> |   * a random sample of targrows rows (or less, if there are less in the
> |!  * sample of blocks). In this case, targblocks is always the same as
> |!  * targrows, so we always read one row per block.
> 
> This is just wrong, unless you add "on average".  Even then it is a
> bit misleading, because in most cases we *read* more tuples than we
> use.

The current code reads a number of blocks == targrows, hence "one per
block" but I'll change the comment.

> |   * Although every row has an equal chance of ending up in the final
> |   * sample, this sampling method is not perfect: not every possible
> |   * sample has an equal chance of being selected.  For large relations
> |   * the number of different blocks represented by the sample tends to be
> |!  * too small.  In that case, block sampling should be used.
> 
> Is the last sentence a fact or personal opinion?

That is my contention, based upon research. The existing code clearly
identifies that case as requiring a solution. We use Chaudhuri et al's
1998 result for the sufficiency of sample size, yet overlook his
estimators and his ideas for random block selection.

My first point is that we need proportional sample size, not fixed size.
[I show that for large tables we currently use a sample that is 2-5
times too small, even based on Chaudhuri's work.]

ANALYZE is very I/O bound: random row sampling would need to scan the
whole table to take a 2-3% sample in most cases. Block sampling is a
realistic way of getting a larger sample size without loss of
performance, and also a way of getting a reasonable sample when
accessing very few blocks.

We would keep random row sampling for smaller relations where the
performance effects of sample collection are not noticeable.

> I haven't been following the discussion too closely but didn't you say
> that a block sampling algorithm somehow compensates for the fact that
> the sample is not random?

I haven't said that block sampling compensates for the sample being
non-random. There is a danger of bias in the sample and I made that
clear. I've tried to mitigate that by the use of random blocks, reusing
your code and taking note of the earlier problems you solved. However,
block sampling allows us to spot cases where rows are naturally grouped
together, which row sampling doesn't spot until very high sample sizes.
I explored in detail a common case where this causes the current method
to fall down very badly. 

Brutlag & Richardson [2000] give a good run down on block sampling and
I'm looking to implement some of his block estimators to complement the
current patch. (Credit to Andrew Dunstan for locating the paper...)
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps

[I did briefly explore the possibility of hybrid row/block sampling,
using row sampling as well some block sampling to identify grouping,
with an attempt to avoid swamping the sample with biased rows; but I
don't have a strong basis for that, so I've ignored that for now.]

This patch allows us to explore the possible bias that block sampling
gives, allowing us to make a later decision to include it, or not.

Best Regards, Simon Riggs