Thread: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

benchmarking the query planner (was Re: Simple postgresql.conf wizard)

From

"Robert Haas"

Date:

06 December 2008, 14:19:15

Sorry for top posting but we are getting a bit far afield from the
original topic.  I followed up the tests I did last night:

http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php

I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put
together as a synthetic benchmark for default_statistics_target with
various values for "SET STATISTICS n".  Testing was done on CVS HEAD
on my laptop with no configure options other than --prefix.  Then I
did this, to disable compression on pg_statistic.

alter table pg_statistic alter column stanumbers1 set storage external;
alter table pg_statistic alter column stanumbers2 set storage external;
alter table pg_statistic alter column stanumbers3 set storage external;
alter table pg_statistic alter column stanumbers4 set storage external;
alter table pg_statistic alter column stavalues1 set storage external;
alter table pg_statistic alter column stavalues2 set storage external;
alter table pg_statistic alter column stavalues3 set storage external;
alter table pg_statistic alter column stavalues4 set storage external;

(Note that you'll need to put allow_system_table_mods=true in your
postgresql.conf file if you want this to work.)  Then I reran the
tests.  The results were pretty dramatic.  In the table below, the
first column is value of "SET STATISTICS n" that was performed the
table column prior to analyzing it.  The second column is the time
required to plan the query 100x AFTER disabling compression on
pg_statistic, and the third column is the time required to plan the
query 100x BEFORE disabling compression on pg_statistic.

10    0.829202    0.8249
20    1.059976    1.06957
30    1.168727    1.143803
40    1.287189    1.263252
50    1.370167    1.363951
60    1.486589    1.460464
70    1.603899    1.571107
80    1.69402    1.689651
90    1.79068    1.804454
100    1.930877    2.803941
150    2.446471    4.833002
200    2.95323    6.217708
250    3.436741    7.507919
300    3.983568    8.895015
350    4.497475    10.201713
400    5.072471    11.576961
450    5.615272    12.933128
500    6.286358    14.408157
550    6.895951    15.745378
600    7.400134    17.192916
650    8.038159    18.568616
700    8.606704    20.025952
750    9.154889    21.45775
800    9.80953    22.74635
850    10.363471    24.057379
900    11.022348    25.559911
950    11.69732    27.021034
1000    12.266699    28.711027

As you can see, for default_statistics_target > 90, this is a HUGE win.

After doing this test, I rebuilt with --enable-profiling and profiled
EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla
configuration, and then 200 again with compression turned off as
described above.  The, ahem, ridiculously small limit on attachment
size prevents me from attaching the full results, so please see the
attached results which are truncated after the first section.  10x
doesn't seem to be quite enough to get the exact picture of where the
bottlenecks are, but the overall picture is clear enough:
decompression introduces a huge overhead.

Looking specifically at the 200-decompress output, the next biggest
hit is AllocSetAlloc(), which, from the detailed results that I
unfortunately can't include, is being called mostly by datumCopy()
which is being called mostly by get_attstatsslot().  There are 4000
calls to get_attstatsslot() which result 701,500 calls to datumCopy().

I'm not too sure what any of this means in terms of optimizatiion,
other than that changing the storage type of pg_statistic columns to
external looks like a huge win.  Perhaps someone more knowledgeable
than I has some thoughts.

...Robert

Attachment

gmon-summary.tbz

Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

From

Greg Stark

Date:

06 December 2008, 15:21:51

That might only be the case when the pg_statistic record is in shared  
buffers.

Also I wonder if eqjoinsel and company might need to be made more  
toast-aware by detoasring all the things it needs once rather than  
every time it accesses them.

greg

On 6 Dec 2008, at 06:19 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

> Sorry for top posting but we are getting a bit far afield from the
> original topic.  I followed up the tests I did last night:
>
> http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php
>
> I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put
> together as a synthetic benchmark for default_statistics_target with
> various values for "SET STATISTICS n".  Testing was done on CVS HEAD
> on my laptop with no configure options other than --prefix.  Then I
> did this, to disable compression on pg_statistic.
>
> alter table pg_statistic alter column stanumbers1 set storage  
> external;
> alter table pg_statistic alter column stanumbers2 set storage  
> external;
> alter table pg_statistic alter column stanumbers3 set storage  
> external;
> alter table pg_statistic alter column stanumbers4 set storage  
> external;
> alter table pg_statistic alter column stavalues1 set storage external;
> alter table pg_statistic alter column stavalues2 set storage external;
> alter table pg_statistic alter column stavalues3 set storage external;
> alter table pg_statistic alter column stavalues4 set storage external;
>
> (Note that you'll need to put allow_system_table_mods=true in your
> postgresql.conf file if you want this to work.)  Then I reran the
> tests.  The results were pretty dramatic.  In the table below, the
> first column is value of "SET STATISTICS n" that was performed the
> table column prior to analyzing it.  The second column is the time
> required to plan the query 100x AFTER disabling compression on
> pg_statistic, and the third column is the time required to plan the
> query 100x BEFORE disabling compression on pg_statistic.
>
> 10    0.829202    0.8249
> 20    1.059976    1.06957
> 30    1.168727    1.143803
> 40    1.287189    1.263252
> 50    1.370167    1.363951
> 60    1.486589    1.460464
> 70    1.603899    1.571107
> 80    1.69402    1.689651
> 90    1.79068    1.804454
> 100    1.930877    2.803941
> 150    2.446471    4.833002
> 200    2.95323    6.217708
> 250    3.436741    7.507919
> 300    3.983568    8.895015
> 350    4.497475    10.201713
> 400    5.072471    11.576961
> 450    5.615272    12.933128
> 500    6.286358    14.408157
> 550    6.895951    15.745378
> 600    7.400134    17.192916
> 650    8.038159    18.568616
> 700    8.606704    20.025952
> 750    9.154889    21.45775
> 800    9.80953    22.74635
> 850    10.363471    24.057379
> 900    11.022348    25.559911
> 950    11.69732    27.021034
> 1000    12.266699    28.711027
>
> As you can see, for default_statistics_target > 90, this is a HUGE  
> win.
>
> After doing this test, I rebuilt with --enable-profiling and profiled
> EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla
> configuration, and then 200 again with compression turned off as
> described above.  The, ahem, ridiculously small limit on attachment
> size prevents me from attaching the full results, so please see the
> attached results which are truncated after the first section.  10x
> doesn't seem to be quite enough to get the exact picture of where the
> bottlenecks are, but the overall picture is clear enough:
> decompression introduces a huge overhead.
>
> Looking specifically at the 200-decompress output, the next biggest
> hit is AllocSetAlloc(), which, from the detailed results that I
> unfortunately can't include, is being called mostly by datumCopy()
> which is being called mostly by get_attstatsslot().  There are 4000
> calls to get_attstatsslot() which result 701,500 calls to datumCopy().
>
> I'm not too sure what any of this means in terms of optimizatiion,
> other than that changing the storage type of pg_statistic columns to
> external looks like a huge win.  Perhaps someone more knowledgeable
> than I has some thoughts.
>
> ...Robert
> <gmon-summary.tbz>

Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

From

Tom Lane

Date:

08 December 2008, 12:57:09

Greg Stark <greg.stark@enterprisedb.com> writes:
> That might only be the case when the pg_statistic record is in shared  
> buffers.

Yeah, it seems unlikely that disabling compression is a good idea in the
bigger scheme of things.

> Also I wonder if eqjoinsel and company might need to be made more  
> toast-aware by detoasring all the things it needs once rather than  
> every time it accesses them.

At least in this example, the same stats arrays are fetched multiple
times per query.  It might be worth teaching the planner to cache those
results internally during a plan cycle --- this would avoid most of the
complexity associated with caching (ie, figuring out when to clear the
cache) and still get maybe a 2x or 3x or so win.
        regards, tom lane

Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

From

"Robert Haas"

Date:

08 December 2008, 13:02:21

On Mon, Dec 8, 2008 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <greg.stark@enterprisedb.com> writes:
>> That might only be the case when the pg_statistic record is in shared
>> buffers.
>
> Yeah, it seems unlikely that disabling compression is a good idea in the
> bigger scheme of things.

Is there any way to figure out what sort of compression ratios we're
typically getting?  The only way compression can really be a win is if
it's saving us having to read in additional pages.

Even then, I'm not sure we should worry about needing to read in
additional pages in this case.  I would sure expect statistics to stay
in shared buffers, or at least in the operating system cache.  Sure,
if the table is accessed VERY infrequently, that might not be the
case, but is that really what we want to optimize for?

...Robert

Re: benchmarking the query planner

From

"Robert Haas"

Date:

10 December 2008, 00:25:08

On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote:
> I tried a different query, trying to get quadratic growth and again failed. It

The profiling results I sent the other day show an exactly-linear
increase in the number of times eqjoinsel invokes FunctionCall2.
Reading through the the eqjoinsel_inner loop in selfuncs.c beginning
around line 2042, I think what is happening is this: since the two
tables are really the same table, nvalues1 and nvalues2 are the same
array, and therefore contain the same elements in the same order.  As
a result, for each i, we skip over the first i - 1 entries in
nvalues2, which have already been matched, and then compare element i
of nvalues1 to element i of nvalues2 and mark them both as matched.

Although this is technically an O(n^2) algorithm, the constant is very
low, because this code is Really Fast:

if (hasmatch[j])   continue;

To get observable quadratic behavior, I think you might need to
construct two MCV arrays where all the values are the same, but the
arrays are not in the same order.  I believe they are sorted by
frequency, so it might be sufficient to arrange things so that the
N'th most common value in one table is the (statistics_target  + 1 -
N)'th most common value in the other.  It might also help to use a
function with a real slow comparison function (perhaps one
intentionally constructed to be slow).

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

10 December 2008, 13:59:37

"Robert Haas" <robertmhaas@gmail.com> writes:
> On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote:
>> I tried a different query, trying to get quadratic growth and again failed. It

> The profiling results I sent the other day show an exactly-linear
> increase in the number of times eqjoinsel invokes FunctionCall2.
> Reading through the the eqjoinsel_inner loop in selfuncs.c beginning
> around line 2042, I think what is happening is this: since the two
> tables are really the same table, nvalues1 and nvalues2 are the same
> array, and therefore contain the same elements in the same order.

Yeah, that would be fast.  To see a quadratic case you need MCV arrays
that have little or no overlap of common values --- then each element of
the first will be compared (in vain) to all or most of the elements in
the second.
        regards, tom lane

Re: benchmarking the query planner

From

"Robert Haas"

Date:

10 December 2008, 22:28:30

> Yeah, that would be fast.  To see a quadratic case you need MCV arrays
> that have little or no overlap of common values --- then each element of
> the first will be compared (in vain) to all or most of the elements in
> the second.

Ah, that makes sense.  Here's a test case based on Greg's.  This is
definitely more than linear once you get above about n = 80, but it's
not quadratic either.  n = 1000 is only 43x n = 80, and while that's
surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2
= 156.25.

create table tk as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk (select * from tk);
insert into tk (select * from tk);
insert into tk (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk2 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk2 (select * from tk2);
insert into tk2 (select * from tk2);
insert into tk2 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk3 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk3 (select * from tk3);
insert into tk3 (select * from tk3);
insert into tk3 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk4 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk4 (select * from tk4);
insert into tk4 (select * from tk4);
insert into tk4 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk5 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk5 (select * from tk5);
insert into tk5 (select * from tk5);
insert into tk5 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk6 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk6 (select * from tk6);
insert into tk6 (select * from tk6);
insert into tk6 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

and then (after disabling autovacuum):

set default_statistics_target = XXX;
analyze;
repeat 100x: explain select count(*) from (select * from tk as k, tk2
as l,tk3 as m,tk4 as n,tk5 as o,tk6 as p where k.r=l.r and k.r=m.r and
k.r=n.r and k.r=o.r and k.r=p.r) as x;

Timings (for 100 iterations):

10    0.900309
20    1.189229
30    1.280892
40    1.447358
50    1.611779
60    1.795701
70    2.001245
80    2.286144
90    2.955732
100    3.925557
150    6.472436
200    9.010824
250    11.89753
300    15.109172
350    18.813514
400    22.901383
450    27.842019
500    32.02136
550    37.609196
600    42.894322
650    48.460327
700    55.169819
750    61.568125
800    68.222201
850    75.027591
900    82.918344
950    91.235267
1000    99.737802

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 14:17:31

"Robert Haas" <robertmhaas@gmail.com> writes:
> Ah, that makes sense.  Here's a test case based on Greg's.  This is
> definitely more than linear once you get above about n = 80, but it's
> not quadratic either.  n = 1000 is only 43x n = 80, and while that's
> surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2
> = 156.25.

Yeah, that's not too surprising.  There's a linear cost associated with
fetching/deconstructing the stats array, and a quadratic cost in the
comparison loop.  What these results say is that the upper limit of 1000
keeps you from getting into the range where the linear component is
completely negligible.  If you plot the results and try to fit a curve
like c1*x^2 + c2*x to them, you get a very good fit for all the points
above about 80.  Below that, the curve has a flatter slope, indicating
a smaller linear cost component.  The values in this test are about 100
bytes each, so the breakpoint at 80 seems to correspond to what happens
when the stats array no longer fits in one page.

I replicated your test and got timings like these, using CVS HEAD with
cassert off:

10    1.587
20    1.997
30    2.208
40    2.499
50    2.734
60    3.048
70    3.368
80    3.706
90    4.686
100    6.418
150    10.016
200    13.598
250    17.905
300    22.777
400    33.471
500    46.394
600    61.185
700    77.664
800    96.304
900    116.615
1000    140.117

So this machine is a bit slower than yours, but otherwise it's pretty
consistent with your numbers.  I then modified the test to use

array[random()::text,random()::text,random()::text,random()::text,random()::text,random()::text]

ie, the same data except stuffed into an array.  I did this because
I know that array_eq is pretty durn expensive, and indeed:

10    1.662
20    2.478
30    3.119
40    3.885
50    4.636
60    5.437
70    6.403
80    7.427
90    8.473
100    9.597
150    16.542
200    24.919
250    35.225
300    47.423
400    76.509
500    114.076
600    157.535
700    211.189
800    269.731
900    335.427
1000    409.638

When looking at these numbers one might think the threshold of pain
is about 50, rather than 100 which is where I'd put it for the text
example.  However, this is probably an extreme worst case.

On the whole I think we have some evidence here to say that upping the
default value of default_stats_target to 100 wouldn't be out of line,
but 1000 definitely is.  Comments?

BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000?  These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.
        regards, tom lane

Re: benchmarking the query planner

From

"Vladimir Sitnikov"

Date:

11 December 2008, 14:53:41

BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000? These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.

Do you consider using hash tables?

I am not sure hash is a perfect match here, however I guess some kind of data structure might improve N^2 behaviour. Looks like that would improve both array_eq (that will narrow the list of possible arrays to the single hash bucket) and large _target (I guess that would improve N^2 to N)

Regards,

Vladimir Sitnikov

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 15:07:33

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:
> Do you consider using hash tables?

Doubt it's really worth it, unless there's some way to amortize the
setup cost across multiple selectivity estimations; which would surely
complicate life.

One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution.  So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.
        regards, tom lane

Re: benchmarking the query planner

From

"Robert Haas"

Date:

11 December 2008, 15:09:10

> When looking at these numbers one might think the threshold of pain
> is about 50, rather than 100 which is where I'd put it for the text
> example.  However, this is probably an extreme worst case.
>
> On the whole I think we have some evidence here to say that upping the
> default value of default_stats_target to 100 wouldn't be out of line,
> but 1000 definitely is.  Comments?

Do you think there's any value in making it scale based on the size of
the table?  How hard would it be?  If you made it MIN(10 + 0.001 *
estimated_rows, 100), you would probably get most of the benefit while
avoiding unnecessary overhead for small tables.

Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump
for one release, especially since it may cause the statistics to get
toasted in some cases, which comes with a significant performance hit.I would raise it to 30 or 50 and plan to consider
raisingit further
 
down the road.  (I realize I just made about a million enemies with
that suggestion.)

> BTW, does anyone have an opinion about changing the upper limit for
> default_stats_target to, say, 10000?  These tests suggest that you
> wouldn't want such a value for a column used as a join key, but
> I can see a possible argument for high values in text search and
> similar applications.

I think that's a good idea.  Given that most people probably don't
both fiddling with this parameter at all, it doesn't strike me as much
of a foot-gun.  I think you'd need a heck of a big table to justify a
value in that range, but some people may have them.

...Robert

Re: benchmarking the query planner

From

"Vladimir Sitnikov"

Date:

11 December 2008, 15:31:30

> Do you consider using hash tables?

Doubt it's really worth it, unless there's some way to amortize the
setup cost across multiple selectivity estimations; which would surely
complicate life.

MCV lists are updated only during "analyze" phase, don't they? If the "setup cost" is the cost of maintaining those "hash tables", it is not going to hurt much.

One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution. So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.

That makes sense, however, linear search for single item in the list of 10'000 elements could take a while. Hash lookup might be better choice.

Regards,

Vladimir Sitnikov

Re: benchmarking the query planner

From

"Robert Haas"

Date:

11 December 2008, 16:16:26

On Thu, Dec 11, 2008 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:
>> Do you consider using hash tables?
>
> Doubt it's really worth it, unless there's some way to amortize the
> setup cost across multiple selectivity estimations; which would surely
> complicate life.
>
> One thing that just now occurred to me is that as long as we maintain
> the convention that MCV lists are in decreasing frequency order, one can
> take any prefix of the list and it's a perfectly good MCV list of less
> resolution.  So one way to reduce the time taken in eqjoinsel is to set
> an upper limit on the number of entries considered *by that routine*,
> whereas other estimator functions could use larger lists.

To what extent will that negate the benefit of having those statistics
in the first place?

Here's another idea.  If you have a < operator, you could use a
quicksort-type strategy to partition the search space.  Pick an
arbitrary element of either list and apply it to all elements of both
lists to divide the initial problem into two problems that are each
half as large.  When the subproblems fall below some size threshold,
then solve them according to the existing algorithm.  This is O(n^2)
in the worst case, just like quicksort, but the worst case is
difficult to construct.

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 16:58:04

"Robert Haas" <robertmhaas@gmail.com> writes:
>> On the whole I think we have some evidence here to say that upping the
>> default value of default_stats_target to 100 wouldn't be out of line,
>> but 1000 definitely is.  Comments?

> Do you think there's any value in making it scale based on the size of
> the table?

As far as the MCVs go, I think we already have a decent heuristic for
determining the actual length of the array, based on discarding values
that have too small an estimated frequency --- look into
compute_scalar_stats.  I don't see that explicitly considering table
size would improve that.  It might be worth limiting the size of the
histogram, as opposed to the MCV list, for smaller tables.  But that's
not relevant to the speed of eqjoinsel, and AFAIK there aren't any
estimators that are O(N^2) in the histogram length.
(ineq_histogram_selectivity is actually O(log N), so it hardly cares at
all.)

> Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump
> for one release, especially since it may cause the statistics to get
> toasted in some cases, which comes with a significant performance hit.
>  I would raise it to 30 or 50 and plan to consider raising it further
> down the road.  (I realize I just made about a million enemies with
> that suggestion.)

There's something in what you say, but consider that we have pretty
much unanimous agreement that 10 is too small.  I think we should
try to fix the problem, not just gradually ratchet up the value until
people start complaining in the other direction.  (Also, we should have
plenty of opportunity during beta to find out if we went too far.)
        regards, tom lane

Re: benchmarking the query planner

From

"Vladimir Sitnikov"

Date:

11 December 2008, 17:19:46

There's something in what you say, but consider that we have pretty
much unanimous agreement that 10 is too small. I think we should
try to fix the problem, not just gradually ratchet up the value until
people start complaining in the other direction. (Also, we should have
plenty of opportunity during beta to find out if we went too far.)

I am not sure if entity-attribute-value model could be used for postgres database, however that is one of the cases that require large MCV list (generally, for attribute column).

You know, Oracle is not able to store more than 254 distinct values for histogram statistics. That really limits the use of histograms for software product the company I work for creates.

One more direction could be implementing "MCV" for range of values (group values and interpolate in between). Consider statistics on timestamp column that says that for "2008-December" there are as many X rows, for "2008-November" as many as Y, etc. That could be used for rather accurate cardinality estimation of "between" cases, while keeping number of entries in "MCV" list small.

Regards,

Vladimir Sitnikov

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 17:29:30

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:
> One more direction could be implementing "MCV" for range of values (group
> values and interpolate in between). Consider statistics on timestamp column
> that says that for "2008-December" there are as many X rows, for
> "2008-November" as many as Y, etc.  That could be used for rather accurate
> cardinality estimation of "between" cases, while keeping number of entries
> in "MCV" list small.

I think that would likely be redundant with the histogram.
        regards, tom lane

Re: benchmarking the query planner

From

"Nathan Boley"

Date:

11 December 2008, 17:43:51

> One more direction could be implementing "MCV" for range of values (group
> values and interpolate in between). Consider statistics on timestamp column
> that says that for "2008-December" there are as many X rows, for
> "2008-November" as many as Y, etc.  That could be used for rather accurate
> cardinality estimation of "between" cases, while keeping number of entries
> in "MCV" list small.
>

I suggested this earlier (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ).
If there's interest, I'd be happy to clean up my patch and submit it
for 8.5

-Nathan

Re: benchmarking the query planner

From

"Vladimir Sitnikov"

Date:

11 December 2008, 17:54:49

I think that would likely be redundant with the histogram.

I am afraid I do not get you. I mean histograms should be considered when it comes to increasing number of MCV entries (at least for numeric/timestamp values). With histogram lower number of entries could be sufficient to get reasonable accuracy of estimations.

I have no idea how to decide between automatic switch between histograms and MCV. It might sound crazy one could compute both histograms and MCV and use them both (and pick an average of two estimations :) )

Regards,

Vladimir Sitnikov

Re: benchmarking the query planner

From

Simon Riggs

Date:

11 December 2008, 18:05:39

On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote:

> On the whole I think we have some evidence here to say that upping the
> default value of default_stats_target to 100 wouldn't be out of line,
> but 1000 definitely is.  Comments?

Sounds good to me.

I would like it even more if there was a data type specific default.
Currently we have a special case for boolean, but that's it. TPC allows
for different defaults for different types (but that's not my only
reason). That would allow us to set it lower for long text strings and
floats and potentially higher for things like smallint, which is less
likely as a join target. 

Would be great if we could set the default_stats_target for all columns
in a table with a single ALTER TABLE statement, rather than setting it
for every column individually.

And I would like it even more if the sample size increased according to
table size, since that makes ndistinct values fairly random for large
tables.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 18:28:00

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:
>> I think that would likely be redundant with the histogram.
>> 
> I am afraid I do not get you.

AFAICS what you're proposing *is* a histogram, just awkwardly described.
What is the specific difference between what you are talking about and
what scalarineqsel already implements?
        regards, tom lane

Re: benchmarking the query planner

From

Gregory Stark

Date:

11 December 2008, 18:29:51

Simon Riggs <simon@2ndQuadrant.com> writes:

> On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote:
>
>> On the whole I think we have some evidence here to say that upping the
>> default value of default_stats_target to 100 wouldn't be out of line,
>> but 1000 definitely is.  Comments?
>
> Sounds good to me.
>
> I would like it even more if there was a data type specific default.
> Currently we have a special case for boolean, but that's it. TPC allows
> for different defaults for different types (but that's not my only
> reason). That would allow us to set it lower for long text strings and
> floats and potentially higher for things like smallint, which is less
> likely as a join target. 

In conjunction with a toast rethink it would be interesting to engineer things
so that we keep one page's worth of data after toasting for each of the
arrays. That would at least remove a lot of the dangers of larger arrays.

> Would be great if we could set the default_stats_target for all columns
> in a table with a single ALTER TABLE statement, rather than setting it
> for every column individually.

Hm, just because one column has a skewed data distribution doesn't tell you
much about what the other columns' data distributions are. On the other hand I
could see an argument that a given table might be consistently used in only
large batch queries or quick OLTP queries.

> And I would like it even more if the sample size increased according to
> table size, since that makes ndistinct values fairly random for large
> tables.

Unfortunately _any_ ndistinct estimate based on a sample of the table is going
to be pretty random.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 18:38:11

Gregory Stark <stark@enterprisedb.com> writes:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> And I would like it even more if the sample size increased according to
>> table size, since that makes ndistinct values fairly random for large
>> tables.

> Unfortunately _any_ ndistinct estimate based on a sample of the table is going
> to be pretty random.

Yeah, it's a hard problem.  It's worth noting though that increasing the
default stats target 10X is already going to result in a 10X increase in
the default sample size, so we should see at least some improvement in
the typical quality of ndistinct estimates.
        regards, tom lane

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 18:46:42

Simon Riggs <simon@2ndQuadrant.com> writes:
> I would like it even more if there was a data type specific default.
> Currently we have a special case for boolean, but that's it.

No, we don't (or if we do I'd be interested to know where).  I don't see
much value in a data-type-dependent default anyway --- would you make
different defaults for int4, int8, and float8, and on what grounds?
The actual data contents of three such columns could easily be exactly
equivalent.
        regards, tom lane

Re: benchmarking the query planner

From

"Vladimir Sitnikov"

Date:

11 December 2008, 18:47:14

What is the specific difference between what you are talking about and
what scalarineqsel already implements?

Hmm... Northing new. Feel sorry for bothering you. I did not realize histograms are implemented.

Regards,

Vladimir Sitnikov

Re: benchmarking the query planner

From

Gregory Stark

Date:

11 December 2008, 18:48:55

Tom Lane <tgl@sss.pgh.pa.us> writes:

> 500    114.076
> 600    157.535
> 700    211.189
> 800    269.731
> 900    335.427
> 1000    409.638
>...
> BTW, does anyone have an opinion about changing the upper limit for
> default_stats_target to, say, 10000?  These tests suggest that you
> wouldn't want such a value for a column used as a join key, but
> I can see a possible argument for high values in text search and
> similar applications.

I don't like the existing arbitrary limit which it sounds like people are
really bumping into. But that curve looks like it might be getting awfully
steep. I wonder just how long 10,000 would take?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 18:57:39

Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> BTW, does anyone have an opinion about changing the upper limit for
>> default_stats_target to, say, 10000?  These tests suggest that you
>> wouldn't want such a value for a column used as a join key, but
>> I can see a possible argument for high values in text search and
>> similar applications.

> I don't like the existing arbitrary limit which it sounds like people are
> really bumping into. But that curve looks like it might be getting awfully
> steep. I wonder just how long 10,000 would take?

Presumably, right about 100X longer than 1000 ... if we don't do
anything about limiting the number of values eqjoinsel looks at.

I think though that the case for doing so is pretty good.  "MCVs" that
are beyond the K'th entry can't possibly have frequencies greater than
1/K, and in most cases it'll be a lot less.  So the incremental
contribution to the accuracy of the join selectivity estimate drops off
pretty quickly, I should think.  And it's not like we're ignoring the
existence of those values entirely --- we'd just be treating them as if
they are part of the undifferentiated collection of non-MCV values.

It might be best to stop when the frequency drops below some threshold,
rather than taking a fixed number of entries.
        regards, tom lane

Re: benchmarking the query planner

From

"Robert Haas"

Date:

11 December 2008, 19:28:30

On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> I would like it even more if there was a data type specific default.
>> Currently we have a special case for boolean, but that's it.
>
> No, we don't (or if we do I'd be interested to know where).

utils/adt/selfuncs.c, get_variable_numdistinct

...Robert

Re: benchmarking the query planner

From

Simon Riggs

Date:

11 December 2008, 19:36:35

On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > I would like it even more if there was a data type specific default.
> > Currently we have a special case for boolean, but that's it.
> 
> No, we don't (or if we do I'd be interested to know where). 

Your commit, selfuncs.c, 7 Jul.

>  I don't see
> much value in a data-type-dependent default anyway --- would you make
> different defaults for int4, int8, and float8, and on what grounds?
> The actual data contents of three such columns could easily be exactly
> equivalent.

I would prefer distinct type or domain specific defaults, but until we
have that I would settle for datatype specific defaults. 

Defaults, not only permissible settings.

Most people don't keep same data in float8 as they do in int4, but
neither of those were ones I was thinking about. I see 3 main classes:
* data with small number of distinct values (e.g. boolean, smallint)
* data with many distinct values
* data with where every value is typically unique (e.g. text)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 19:39:10

"Robert Haas" <robertmhaas@gmail.com> writes:
> On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> I would like it even more if there was a data type specific default.
>>> Currently we have a special case for boolean, but that's it.
>> 
>> No, we don't (or if we do I'd be interested to know where).

> utils/adt/selfuncs.c, get_variable_numdistinct

That's got nothing to do with stats collection; it's a fallback case
used if no stats are available.
        regards, tom lane

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 19:44:43

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> I would like it even more if there was a data type specific default.
>>> Currently we have a special case for boolean, but that's it.
>> 
>> No, we don't (or if we do I'd be interested to know where). 

> Your commit, selfuncs.c, 7 Jul.

As with Robert's pointer, that's about coping with missing stats,
not about determining what stats to collect.

>  ... neither of those were ones I was thinking about. I see 3 main classes:
> * data with small number of distinct values (e.g. boolean, smallint)
> * data with many distinct values
> * data with where every value is typically unique (e.g. text)

These three categories are already dealt with in an entirely
type-independent fashion by the heuristics in compute_scalar_stats.
I think it's quite appropriate to drive them off the number of observed
values, not guesses about what a particular datatype is used for.
        regards, tom lane

Re: benchmarking the query planner

From

Simon Riggs

Date:

11 December 2008, 19:45:09

On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:

> > And I would like it even more if the sample size increased according
> to table size, since that makes ndistinct values fairly random for
> large
> > tables.
> 
> Unfortunately _any_ ndistinct estimate based on a sample of the table
> is going to be pretty random.

We know that constructed data distributions can destroy the
effectiveness of the ndistinct estimate and make sample size irrelevant.
But typical real world data distributions do improve their estimations
with increased sample size and so it is worthwhile.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

"Kevin Grittner"

Date:

11 December 2008, 19:48:06

>>> Simon Riggs <simon@2ndQuadrant.com> wrote: 
> On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote:
>>  I don't see
>> much value in a data-type-dependent default anyway
I am dubious about the value there, too, at least for our environment.
> I would prefer distinct type or domain specific defaults
Now that would be really useful.
-Kevin

Re: benchmarking the query planner

From

Bruce Momjian

Date:

11 December 2008, 19:51:00

Tom Lane wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
> >> On the whole I think we have some evidence here to say that upping the
> >> default value of default_stats_target to 100 wouldn't be out of line,
> >> but 1000 definitely is.  Comments?
> 
> > Do you think there's any value in making it scale based on the size of
> > the table?
> 
> As far as the MCVs go, I think we already have a decent heuristic for
> determining the actual length of the array, based on discarding values
> that have too small an estimated frequency --- look into
> compute_scalar_stats.  I don't see that explicitly considering table
> size would improve that.  It might be worth limiting the size of the
> histogram, as opposed to the MCV list, for smaller tables.  But that's
> not relevant to the speed of eqjoinsel, and AFAIK there aren't any
> estimators that are O(N^2) in the histogram length.
> (ineq_histogram_selectivity is actually O(log N), so it hardly cares at
> all.)

Why is selfuncs.c::var_eq_const() doing a linear scan over the MCV array
instead of having the list sorted and doing a binary search on the
array?  We already do this for histogram lookups, as you mentioned. 
Does it not matter?  It didn't for ten values but might for larger
distinct lists.

> > Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump
> > for one release, especially since it may cause the statistics to get
> > toasted in some cases, which comes with a significant performance hit.
> >  I would raise it to 30 or 50 and plan to consider raising it further
> > down the road.  (I realize I just made about a million enemies with
> > that suggestion.)
> 
> There's something in what you say, but consider that we have pretty
> much unanimous agreement that 10 is too small.  I think we should
> try to fix the problem, not just gradually ratchet up the value until
> people start complaining in the other direction.  (Also, we should have
> plenty of opportunity during beta to find out if we went too far.)

I am excited we are addresssing the low default statistics target value.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 19:53:02

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:
>>> And I would like it even more if the sample size increased according
>>> to table size, since that makes ndistinct values fairly random for
>>> large tables.
>> 
>> Unfortunately _any_ ndistinct estimate based on a sample of the table
>> is going to be pretty random.

> We know that constructed data distributions can destroy the
> effectiveness of the ndistinct estimate and make sample size irrelevant.
> But typical real world data distributions do improve their estimations
> with increased sample size and so it is worthwhile.

This is handwaving unsupported by evidence.  If you've got a specific
proposal what to change the sample size to and some numbers about what
it might gain us or cost us, I'm all ears.
        regards, tom lane

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 19:58:28

Bruce Momjian <bruce@momjian.us> writes:
> Why is selfuncs.c::var_eq_const() doing a linear scan over the MCV array
> instead of having the list sorted and doing a binary search on the
> array?

1. Keeping the MCV array sorted by frequency allows cheap extraction
of less-accurate MCV lists; you just take the first N entries.

2. Binary search presumes that you have a less-than operator, a concept
not implicit in the notion of MCVs (but it *is* implicit in a
histogram).
        regards, tom lane

Re: benchmarking the query planner

From

"Nathan Boley"

Date:

11 December 2008, 20:23:59

>> What is the specific difference between what you are talking about and
>> what scalarineqsel already implements?
>
> Hmm... Northing new. Feel sorry for bothering you. I did not realize
> histograms are implemented.
>

Well, ISTM there is a profound difference. For scalarineqsel we care
about the total number of values in a bucket. For eqsel  we care about
the total number of *distinct* values in each bucket ( which we don't
track ).

IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
the limited storage in pg_statistic to try and estimate the
selectivity for ranges of values rather than a single value? That
gives way better coverage of the distribution. If the number of values
is too high to fit in a single bucket we put it in an mcv slot
anyways. *That* should be the mechanism by which the number of mcv's
increases.

I guess this is a bit off topic for the middle of a commit fest though.

-Nathan

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 20:31:03

"Nathan Boley" <npboley@gmail.com> writes:
> Well, ISTM there is a profound difference. For scalarineqsel we care
> about the total number of values in a bucket. For eqsel  we care about
> the total number of *distinct* values in each bucket

Really?

> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
> the limited storage in pg_statistic to try and estimate the
> selectivity for ranges of values rather than a single value?

MCVs are useful for questions that are not related to ranges of values
--- an example of something highly useful we do with them is to try to
estimate the fraction of a column that satisfies a LIKE or regex
pattern.

In fact, as I was just pointing out to Bruce, we can compute them and
do useful things with them for datatypes that don't have a defined sort
order and so the whole concept of "range" is meaningless.

Now, if your point is that it'd be more flexible to not tie the MCV list
length to the histogram length, you're right.  I'm not sure we can
expect the average DBA to set two separate knobs very effectively,
though.  We do already have some smarts in there to try to set the MCV
list length intelligently instead of slavishly following the stats
target --- perhaps it wouldn't be a bad idea to develop similar
heuristics for choosing an actual histogram length.
        regards, tom lane

Re: benchmarking the query planner

From

"Nathan Boley"

Date:

11 December 2008, 21:35:28

Thanks for the response.

>> Well, ISTM there is a profound difference. For scalarineqsel we care
>> about the total number of values in a bucket. For eqsel  we care about
>> the total number of *distinct* values in each bucket
>
> Really?
>

Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.

Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.

>> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
>> the limited storage in pg_statistic to try and estimate the
>> selectivity for ranges of values rather than a single value?
>
> MCVs are useful for questions that are not related to ranges of values
> --- an example of something highly useful we do with them is to try to
> estimate the fraction of a column that satisfies a LIKE or regex
> pattern.
>

Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.

> In fact, as I was just pointing out to Bruce, we can compute them and
> do useful things with them for datatypes that don't have a defined sort
> order and so the whole concept of "range" is meaningless.
>

Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.

> Now, if your point is that it'd be more flexible to not tie the MCV list
> length to the histogram length, you're right.

No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc.  Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.

I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..

-Nathan

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 21:50:57

"Nathan Boley" <npboley@gmail.com> writes:
> Isn't a selectivity estimate of x = v as ( the number of values in v's
> histogram bucket ) / ( number of distinct values in v's histogram
> bucket ) pretty rational? Thats currently what we do for non-mcv
> values, except that we look at ndistinct over the whole table instead
> of individual histogram buckets.

But the histogram buckets are (meant to be) equal-population, so it
should come out the same either way.  The removal of MCVs from the
population will knock any possible variance in ndistinct down to the
point where I seriously doubt that this could offer a win.  An even
bigger problem is that this requires estimation of ndistinct among
fractions of the population, which will be proportionally less accurate
than the overall estimate.  Accurate ndistinct estimation is *hard*.

> now, if there are 100 histogram buckets then any values that occupy
> more than 1% of the table will be mcv's regardless - why force a value
> to be an mcv if it only occupies 0.1% of the table?

Didn't you just contradict yourself?  The cutoff would be 1% not 0.1%.
In any case there's already a heuristic to cut off the MCV list at some
shorter length (ie, more than 1% in this example) if it seems not
worthwhile to keep the last entries.  See lines 2132ff (in CVS HEAD)
in analyze.c.
        regards, tom lane

Re: benchmarking the query planner

From

"Robert Haas"

Date:

11 December 2008, 21:52:05

> I think though that the case for doing so is pretty good.  "MCVs" that
> are beyond the K'th entry can't possibly have frequencies greater than
> 1/K, and in most cases it'll be a lot less.  So the incremental
> contribution to the accuracy of the join selectivity estimate drops off
> pretty quickly, I should think.  And it's not like we're ignoring the
> existence of those values entirely --- we'd just be treating them as if
> they are part of the undifferentiated collection of non-MCV values.
>
> It might be best to stop when the frequency drops below some threshold,
> rather than taking a fixed number of entries.

OK, I'll bite.  How do we decide where to put the cutoff?  If we make
it too high, it will have a negative effect on join selectivity
estimates; if it's too low, it won't really address the problem we're
trying to fix.  I randomly propose p = 0.001, which should limit
eqjoinsel() to about a million equality tests in the worst case.  In
the synthetic example we were just benchmarking, that causes the
entire MCV array to be tossed out the window (which feels about
right).

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 22:13:31

"Robert Haas" <robertmhaas@gmail.com> writes:
>> It might be best to stop when the frequency drops below some threshold,
>> rather than taking a fixed number of entries.

> OK, I'll bite.  How do we decide where to put the cutoff?  If we make
> it too high, it will have a negative effect on join selectivity
> estimates; if it's too low, it won't really address the problem we're
> trying to fix.  I randomly propose p = 0.001, which should limit
> eqjoinsel() to about a million equality tests in the worst case.  In
> the synthetic example we were just benchmarking, that causes the
> entire MCV array to be tossed out the window (which feels about
> right).

Yeah.  One idle thought I had was that maybe the cutoff needs to
consider both probabilities: if the high-frequency MCVs on one side
chance to match to not-so-high-frequency MCVs on the other, you
would like to know about that.  As long as we keep the lists in
frequency order, it seems easy to implement this: for each MCV
examined by the outer loop, you run the inner loop until the product of
the outer and current inner frequency drops below whatever your
threshold is.  This doesn't immediately suggest what the threshold
ought to be, but it seems like it ought to be possible to determine
that given a desired maximum error in the overall estimate.  I'm also
not very clear on what the "total frequency" computations (matchfreq2
and unmatchfreq2 in the current code) ought to look like if we are using
a variable subset of the inner list.
        regards, tom lane

Re: benchmarking the query planner

From

"Greg Stark"

Date:

11 December 2008, 22:30:22

On Thu, Dec 11, 2008 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:
>
>> > And I would like it even more if the sample size increased according
>> to table size, since that makes ndistinct values fairly random for
>> large
>> > tables.
>>
>> Unfortunately _any_ ndistinct estimate based on a sample of the table
>> is going to be pretty random.
>
> We know that constructed data distributions can destroy the
> effectiveness of the ndistinct estimate and make sample size irrelevant.
> But typical real world data distributions do improve their estimations
> with increased sample size and so it is worthwhile.

Well that just means "more is always better" which puts us back in the
same boat of always needing more.

The existing sampling mechanism is tied to solid statistics. It
provides the correct sample size to get a consistent confidence range
for range queries. This is the same mathematics which governs election
polling and other surveys. The sample size you need to get +/- 5% 19
times out of 20 increases as the population increases, but not by very
much.

However ndistinct is a different kind of question. Instead of needing
the relatively small and slowly growing sample size you need a
percentage of the whole table. It would mean letting this one measure
control the whole sample size decision. And the end result would be
pretty unimpressive. The papers I read showed pretty poor results for
sample sizes as large as 50% of the table.

Simply raising the statistics target for larger tables would not be
very helpful. All it would do is raise the table size at which you
would find the sample size too small. You do have to change to using a
percentage of the whole table -- which would make ANALYZE a *lot*
slower for larger tables.

Really the only way we'll get good ndistinct calculations is if we
decide to scan the whole table. If we have to do that for some other
reason then there are algorithms for gathering enough information to
handle ndistinct properly.

-- 
greg

Re: benchmarking the query planner

From

"Nathan Boley"

Date:

11 December 2008, 22:40:02

>> Isn't a selectivity estimate of x = v as ( the number of values in v's
>> histogram bucket ) / ( number of distinct values in v's histogram
>> bucket ) pretty rational? Thats currently what we do for non-mcv
>> values, except that we look at ndistinct over the whole table instead
>> of individual histogram buckets.
>
> But the histogram buckets are (meant to be) equal-population, so it
> should come out the same either way.

Why is it the same either way? The histogram buckets have equal
population in the number of total values, not the number of distinct
value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be
[0,2) and [2,+oo), but they would have 2 and 5 distinct values
respectively.

> The removal of MCVs from the
> population will knock any possible variance in ndistinct down to the
> point where I seriously doubt that this could offer a win.

Well, if the number of MCV's is large enough then it certainly won't
matter. But isn't that pointlessly inefficient for most distributions?
I provided some empirical evidence of this in an earlier post (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php )
for normal distributions.

> An even
> bigger problem is that this requires estimation of ndistinct among
> fractions of the population, which will be proportionally less accurate
> than the overall estimate.  Accurate ndistinct estimation is *hard*.
>

Agreed, but I'm not convinced that the overall error rate will go up
for our current estimator. Ndistinct estimation is hardest for
populations with a large ndistinct count variation. If the assumptions
underlying this are founded ( namely that ndistinct distributions are
related to the ordering relation in a predictable way ) then ndistinct
estimation should be easier because the ndistinct distribution will be
more uniform. But that's certainly something that should be tested on
real data.

>> now, if there are 100 histogram buckets then any values that occupy
>> more than 1% of the table will be mcv's regardless - why force a value
>> to be an mcv if it only occupies 0.1% of the table?
>
> Didn't you just contradict yourself?  The cutoff would be 1% not 0.1%.

No, I'm saying that if the number of histogram buckets is 100, then
even if the mcv list is set to length 2, every value that occupies 1%
or more of the table will be an mcv. However, if the size is fixed to
100 I think that it will be common to see values with relative
frequencies much lower than 1% near the bottom of the list.

> In any case there's already a heuristic to cut off the MCV list at some
> shorter length (ie, more than 1% in this example) if it seems not
> worthwhile to keep the last entries.  See lines 2132ff (in CVS HEAD)
> in analyze.c.

Given an increase in the default stats target, limits like the 25% are
exactly what I think there should be more of.

Can anyone suggest a good data set to test this sort of question on?

-Nathan

Re: benchmarking the query planner

From

"Robert Haas"

Date:

11 December 2008, 23:05:05

>> OK, I'll bite.  How do we decide where to put the cutoff?  If we make
>> it too high, it will have a negative effect on join selectivity
>> estimates; if it's too low, it won't really address the problem we're
>> trying to fix.  I randomly propose p = 0.001, which should limit
>> eqjoinsel() to about a million equality tests in the worst case.  In
>> the synthetic example we were just benchmarking, that causes the
>> entire MCV array to be tossed out the window (which feels about
>> right).
>
> Yeah.  One idle thought I had was that maybe the cutoff needs to
> consider both probabilities: if the high-frequency MCVs on one side
> chance to match to not-so-high-frequency MCVs on the other, you
> would like to know about that.  As long as we keep the lists in
> frequency order, it seems easy to implement this: for each MCV
> examined by the outer loop, you run the inner loop until the product of
> the outer and current inner frequency drops below whatever your
> threshold is.  This doesn't immediately suggest what the threshold

I had this idle thought too, but I didn't write it down because...

> ought to be, but it seems like it ought to be possible to determine
> that given a desired maximum error in the overall estimate.  I'm also
> not very clear on what the "total frequency" computations (matchfreq2
> and unmatchfreq2 in the current code) ought to look like if we are using
> a variable subset of the inner list.

...of this exact concern, which I think is an insurmountable problem.
If you don't consider some of the MCVs AT ALL, I think you can add
their frequencies back in to otherfreq{1,2} and go home, but if you
consider them for some elements of the other list but not all, I'm not
sure there's an appropriate way to proceed.

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

11 December 2008, 23:13:21

"Robert Haas" <robertmhaas@gmail.com> writes:
> I had this idle thought too, but I didn't write it down because...

>> ought to be, but it seems like it ought to be possible to determine
>> that given a desired maximum error in the overall estimate.  I'm also
>> not very clear on what the "total frequency" computations (matchfreq2
>> and unmatchfreq2 in the current code) ought to look like if we are using
>> a variable subset of the inner list.

> ...of this exact concern, which I think is an insurmountable problem.

Maybe so.  If we stick to the other design (end both lists at a preset
frequency threshold) then the math clearly goes through the same as
before, just with num_mcvs that are determined differently.  But can
we prove anything about the maximum error added from that?
        regards, tom lane

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 05:05:50

On Fri, 2008-12-12 at 02:23 +0000, Greg Stark wrote:

> The existing sampling mechanism is tied to solid statistics. It
> provides the correct sample size to get a consistent confidence range
> for range queries. This is the same mathematics which governs election
> polling and other surveys. The sample size you need to get +/- 5% 19
> times out of 20 increases as the population increases, but not by very
> much.

Sounds great, but its not true. The sample size is not linked to data
volume, so how can it possibly give a consistent confidence range?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 05:37:00

On Thu, 2008-12-11 at 18:52 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:
> >>> And I would like it even more if the sample size increased according
> >>> to table size, since that makes ndistinct values fairly random for
> >>> large tables.
> >> 
> >> Unfortunately _any_ ndistinct estimate based on a sample of the table
> >> is going to be pretty random.
> 
> > We know that constructed data distributions can destroy the
> > effectiveness of the ndistinct estimate and make sample size irrelevant.
> > But typical real world data distributions do improve their estimations
> > with increased sample size and so it is worthwhile.
> 
> This is handwaving unsupported by evidence.  

Not at all.

In the paper cited within the ANALYZE code, shown here:
ftp://ftp.research.microsoft.com/users/autoadmin/histogram_conf.pdf
we implement the sample size for reliable histogram size, but ignore
most of the rest of the paper.

Sections (4) Block Level sampling is ignored, yet the conclusions are
(7.2) that it provides a larger and more effective sample size yet
without significantly increasing number of accessed disk blocks.

Haas Stokes [1998] also indicate that accuracy of n-distinct estimation
is linked to sample size.

In a previous post to hackers I looked at the case where values were
physically clustered together in the table, either naturally or via the
CLUSTER command. Section: ESTIMATES OF D FOR DEPENDENT TABLES
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00153.php

In that case the current ANALYZE algorithm fails badly because of fixed
sample size. This is because "ANALYZE randomly samples rows, so that the
average gap between randomly
selected rows increases as the table size increases, because of the
fixed sample size. Since the clustered rows are typically close
together, then the apparent number of multiple instances of the same
data value decreases as the sample fraction decreases. Since the sample
size is currently fixed, this means that the D estimate decreases as the
table size increases. (This is proven in a test case below)."

> If you've got a specific
> proposal what to change the sample size to and some numbers about what
> it might gain us or cost us, I'm all ears.

So my specific proposal is: implement block level sampling.

It allows us to
* increase sample size without increasing number of I/Os
* allows us to account correctly for clustered data

I'm not trying to force this to happen now, I'm just bringing it into
the discussion because its relevant and has not been mentioned.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

"Robert Haas"

Date:

12 December 2008, 07:45:50

On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The existing sampling mechanism is tied to solid statistics. It
>> provides the correct sample size to get a consistent confidence range
>> for range queries. This is the same mathematics which governs election
>> polling and other surveys. The sample size you need to get +/- 5% 19
>> times out of 20 increases as the population increases, but not by very
>> much.
>
> Sounds great, but its not true. The sample size is not linked to data
> volume, so how can it possibly give a consistent confidence range?

I'm not 100% sure how relevant it is to this case, but I think what
Greg is referring to is:

http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size

It is a pretty well-known mathematical fact that for something like an
opinion poll your margin of error does not depend on the size of the
population but only on the size of your sample.

...Robert

Re: benchmarking the query planner

From

"Robert Haas"

Date:

12 December 2008, 08:16:11

On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
>> I had this idle thought too, but I didn't write it down because...
>
>>> ought to be, but it seems like it ought to be possible to determine
>>> that given a desired maximum error in the overall estimate.  I'm also
>>> not very clear on what the "total frequency" computations (matchfreq2
>>> and unmatchfreq2 in the current code) ought to look like if we are using
>>> a variable subset of the inner list.
>
>> ...of this exact concern, which I think is an insurmountable problem.
>
> Maybe so.  If we stick to the other design (end both lists at a preset
> frequency threshold) then the math clearly goes through the same as
> before, just with num_mcvs that are determined differently.  But can
> we prove anything about the maximum error added from that?

I don't think so, because in that design, it's entirely possible that
you'll throw away the entire MCV list if all of the entries are below
the threshold (as in the example we were just benchmarking, supposing
a threshold of 0.001).

An alternative is to pick a threshold T for the maximum number of
equality probes that you're willing to suffer through.  Then given two
MCV lists of lengths M1 and M2 such that M1 * M2 > T, pick the largest
N such that MIN(M1, N) * MIN(M2, N) <= T.  This guarantees that you
always use at least T^(1/2) MCVs.

If you compare this approach with T = 10^6 vs. simply chopping off the
MCV list at p = 0.001, this approach will be more accurate at the cost
of more comparisons.  For example in our test case where all the
comparisons fail, chopping off the MCV list at p = 0.001 results in
ignoring it completely, whereas with this approach we use all 1000
entries just as before.  So it might be appropriate to choose a lower
threshold like, say, T = 10^5, since otherwise we're not preventing
any computation.  (I suppose you could even make this a GUC since any
choice of value is going to be pretty arbitrary...)

I'm not sure to what extent we can bound the amount of error with this
approach, but it's definitely better.  As we've seen, a frequency
cutoff can throw away the entire MCV list; this approach always
retains at least T^(1/2) entries, and more if the other list happens
to be shorter than T^(1/2).  So we can say that the result will never
be worse than it would have been had you set the statistics target to
T^(1/2).

...Robert

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 10:36:07

"Robert Haas" <robertmhaas@gmail.com> writes:
> On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> The existing sampling mechanism is tied to solid statistics.
>> 
>> Sounds great, but its not true. The sample size is not linked to data
>> volume, so how can it possibly give a consistent confidence range?

> It is a pretty well-known mathematical fact that for something like an
> opinion poll your margin of error does not depend on the size of the
> population but only on the size of your sample.

Right.  The solid math that Greg referred to concerns how big a sample
we need in order to have good confidence in the histogram results.
It doesn't speak to whether we get good results for ndistinct (or for
most-common-values, though in practice that seems to work fairly well).

AFAICS, marginal enlargements in the sample size aren't going to help
much for ndistinct --- you really need to look at most or all of the
table to be guaranteed anything about that.

But having said that, I have wondered whether we should consider
allowing the sample to grow to fill maintenance_work_mem, rather than
making it a predetermined number of rows.  One difficulty is that the
random-sampling code assumes it has a predetermined rowcount target;
I haven't looked at whether that'd be easy to change or whether we'd
need a whole new sampling algorithm.
        regards, tom lane

Re: benchmarking the query planner

From

"Greg Stark"

Date:

12 December 2008, 10:50:46

On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> AFAICS, marginal enlargements in the sample size aren't going to help
> much for ndistinct --- you really need to look at most or all of the
> table to be guaranteed anything about that.

Well you only need to maintain a fixed percentage of the table if by
"guaranteed anything" you mean guaranteed a consistent level of
confidence. But even a small percentage like 1% means a very different
behaviour than currently. For large tables it could mean sampling a
*lot* more.

However if by "guaranteed anything" you mean guaranteeing an actual
useful result then it's true. Even samples as large as 50% give a
pretty low confidence estimate.

> But having said that, I have wondered whether we should consider
> allowing the sample to grow to fill maintenance_work_mem

Hm, so I wonder what this does to the time analyze takes. I think it
would be the only thing where raising maintenance_work_mem would
actually increase the amount of time an operation takes. Generally
people raise it to speed up index builds and vacuums etc.

-- 
greg

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 10:54:43

"Robert Haas" <robertmhaas@gmail.com> writes:
> On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe so.  If we stick to the other design (end both lists at a preset
>> frequency threshold) then the math clearly goes through the same as
>> before, just with num_mcvs that are determined differently.  But can
>> we prove anything about the maximum error added from that?

> I don't think so, because in that design, it's entirely possible that
> you'll throw away the entire MCV list if all of the entries are below
> the threshold (as in the example we were just benchmarking, supposing
> a threshold of 0.001).

Right, but the question is how much that really hurts.  It's not like
we are going to pick a completely clueless number for the ignored MCVs;
rather, we are going to assume that they have the same stats as the
remainder of the population.  If the threshold frequency isn't very
large then the error involved should be bounded.  As an example, in the
perfectly flat distribution set up by the speed tests we were just
doing, there actually wouldn't be any error at all (assuming we got
ndistinct right, which of course is a pretty big assumption).  I haven't
consumed enough caffeine yet to try to do the math, but I think that if
you set the threshold as something a bit more than the assumed frequency
of a non-MCV value then it could work.

> An alternative is to pick a threshold T for the maximum number of
> equality probes that you're willing to suffer through.

I'd like to get there from the other direction, ie figure out what
T has to be to get known maximum error.
        regards, tom lane

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 11:00:44

On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
> On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> The existing sampling mechanism is tied to solid statistics. It
> >> provides the correct sample size to get a consistent confidence range
> >> for range queries. This is the same mathematics which governs election
> >> polling and other surveys. The sample size you need to get +/- 5% 19
> >> times out of 20 increases as the population increases, but not by very
> >> much.
> >
> > Sounds great, but its not true. The sample size is not linked to data
> > volume, so how can it possibly give a consistent confidence range?
> 
> I'm not 100% sure how relevant it is to this case, but I think what
> Greg is referring to is:
> 
> http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size
> 
> It is a pretty well-known mathematical fact that for something like an
> opinion poll your margin of error does not depend on the size of the
> population but only on the size of your sample.

Yes, I agree with that *but* that refers to population statistics and
has nothing at all to do with the calculation of ndistinct, which is
what we were discussing. You can't just switch topics and have the
statement remain true.

ndistinct estimation is improved by larger sample sizes, that's what the
maths says and what experimentation shows also.

Note that the estimator we use was shown to be stable in the range of
sample size between 5-20%.
http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf
We currently use a sample size of 300*stats_target. With default=10 that
means our sample size is 0.3% on a 1 million row table, and 0.003% on a
100 million row table (they're common, I find). 

That size of sample is OK for some kinds of statistics, but not much
good for ndistinct estimation.

These issues only show up in the field, they never show up on optimizer
test platforms because they typically are several orders of magnitude
too small. (Conversely, the stats system works very well indeed for
smaller tables... so I'm not trying to debunk everything).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 11:08:17

On Fri, 2008-12-12 at 09:35 -0500, Tom Lane wrote:

> AFAICS, marginal enlargements in the sample size aren't going to help
> much for ndistinct --- you really need to look at most or all of the
> table to be guaranteed anything about that.
> 
> But having said that, I have wondered whether we should consider
> allowing the sample to grow to fill maintenance_work_mem, rather than
> making it a predetermined number of rows.  One difficulty is that the
> random-sampling code assumes it has a predetermined rowcount target;
> I haven't looked at whether that'd be easy to change or whether we'd
> need a whole new sampling algorithm.

I think we need to do block sampling before we increase sample size. On
large tables we currently do one I/O per sampled row, so the I/O cost of
ANALYZE would just increase linearly.

We need the increased sample size for ndistinct, not for other stats. So
I would suggest we harvest a larger sample, use that for ndistinct
estimation, but then sample-the-sample to minimise processing time for
other stats that aren't as sensitive as ndistinct.

Currently we fail badly on columns that have been CLUSTERed and we can
improve that significantly by looking at adjacent groups of rows, i.e.
block sampling.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 11:09:49

"Greg Stark" <stark@enterprisedb.com> writes:
> On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But having said that, I have wondered whether we should consider
>> allowing the sample to grow to fill maintenance_work_mem

> Hm, so I wonder what this does to the time analyze takes. I think it
> would be the only thing where raising maintenance_work_mem would
> actually increase the amount of time an operation takes. Generally
> people raise it to speed up index builds and vacuums etc.

Yeah --- we might need to make it a separate GUC knob instead of tying
it directly to maintenance_work_mem.  But still, is *any* fixed-size
sample really going to help much for large tables?
        regards, tom lane

Re: benchmarking the query planner

From

Gregory Stark

Date:

12 December 2008, 12:10:29

Simon Riggs <simon@2ndQuadrant.com> writes:

> On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
>> It is a pretty well-known mathematical fact that for something like an
>> opinion poll your margin of error does not depend on the size of the
>> population but only on the size of your sample.
>
> Yes, I agree with that *but* that refers to population statistics and
> has nothing at all to do with the calculation of ndistinct, which is
> what we were discussing. You can't just switch topics and have the
> statement remain true.

If you go back to my email that was kind of my point. The existing sample size
is on a solid foundation for the histograms and most use cases for the
statistics. But entirely bogus for ndistinct.

The ndistinct estimate is just piggy-backing on that data. However to fix it
would require switching over to scanning a percentage of the whole table which
would be a massive increase in work for that one calculation. You can't fix it
by just adjusting the sample size slightly.

> Note that the estimator we use was shown to be stable in the range of
> sample size between 5-20%.
> http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf

Uhm, this is a survey of lots of different methods and does lots of analysis.
I don't see any simple conclusions about stability. Perhaps I'm just missing
it in the technical details. Could you point out exactly what part of the
paper you're basing this on and what "stable" means?

> We currently use a sample size of 300*stats_target. With default=10 that
> means our sample size is 0.3% on a 1 million row table, and 0.003% on a
> 100 million row table (they're common, I find). 
>
> That size of sample is OK for some kinds of statistics, but not much
> good for ndistinct estimation.

Right, but increasing our sample size by a factor of 150 for a 100M row table
doesn't seem like a reasonable solution to one metric being bogus.

For that matter, if we do consider sampling 5% of the table we may as well
just go ahead and scan the whole table. It wouldn't take much longer and it
would actually produce good estimates.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 12:17:20

Gregory Stark <stark@enterprisedb.com> writes:
> For that matter, if we do consider sampling 5% of the table we may as well
> just go ahead and scan the whole table. It wouldn't take much longer and it
> would actually produce good estimates.

Yeah.  Anything over a small fraction of a percent is going to imply
fetching every page anyway, for typical row widths.  If you want ANALYZE
to be cheap then you simply don't get to have a trustworthy value of
ndistinct.

Perhaps a better plan is to try to de-emphasize use of ndistinct,
though I concede I have no idea how to do that.
        regards, tom lane

Re: benchmarking the query planner

From

"Kevin Grittner"

Date:

12 December 2008, 12:25:19

>>> "Nathan Boley" <npboley@gmail.com> wrote:
> Can anyone suggest a good data set to test this sort of question on?
Where we have the biggest problem with bad estimates is on complex
searches involving many joins where the main criterion limiting the
result set is a name. The estimate based on the histogram is often
very low (e.g. 2) when the actual result set is several hundred.
While several hundred is far short of 1% of the table, the best plan
for a result set of that size is very different than the best plan for
two rows.
Some numbers follow to give an idea of the shape of data where current
techniques sometimes do poorly. We use a "searchName" column which
puts the name components from various columns into a canonical format;
this is what is indexed and searched. The search is usually a LIKE
with the high order portion being six to ten characters followed by
the wild card.
Total rows in table: 32,384,830
There are 9,958,969 distinct values.
There is one value present in over 1% of the rows, with 433,578 rows.
There are ten values present in over 0.1% of the rows:433578140398135489112088 64069 63158 44656 36499 35896 35819
The 100th most common value is present in 4847 rows.
There are 186 rows with over 0.01% of the rows.
Based on my experience, we would need better estimates for ranges with
200 to 300 rows to improve our plans for the problem cases. I'd be
happy to have it scan the whole table during our nightly VACUUM
ANALYZE if that would get me statistics which would improve the
estimates to that degree without a huge increase in plan time.
Which raises the issue, if we could get better statistics by passing
the whole table, why not do that when VACUUM ANALYZE is run?
-Kevin

Re: benchmarking the query planner

From

"Robert Haas"

Date:

12 December 2008, 12:38:37

> Which raises the issue, if we could get better statistics by passing
> the whole table, why not do that when VACUUM ANALYZE is run?

I think the reason is "because the next autovacuum would undo it".

Perhaps a table-level option to scan the whole table instead of
estimating would be appropriate?
.
...Robert

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 12:42:43

On Fri, 2008-12-12 at 16:10 +0000, Gregory Stark wrote:
> Right, but increasing our sample size by a factor of 150 for a 100M
> row table doesn't seem like a reasonable solution to one metric being
> bogus.
> 
> For that matter, if we do consider sampling 5% of the table we may as
> well just go ahead and scan the whole table. It wouldn't take much
> longer and it would actually produce good estimates.

As I said, we would only increase sample for ndistinct, not for others.

At the moment we completely and significantly fail to assess ndistinct
correctly on clustered data for large tables. Using block level sampling
would prevent that. Right now we may as well use a random number
generator.

The amount of I/O could stay the same, just sample all rows on block.
Lifting the sample size will help large tables. Will it be perfect? No.
But I'll take "better" over "not working at all".

If we are going to quote literature we should believe all the
literature. We can't just listen to some people that did a few tests
with sample size, but then ignore the guy that designed the MS optimizer
and many others.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 12:44:20

On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote:

> If you want ANALYZE to be cheap then you simply don't get to have 
> a trustworthy value of ndistinct.

Understood, but a cheap ANALYZE isn't always a higher priority than all
other considerations.

Solutions can also include

* allowing user to note that we would actually like to scan the whole
table (stats_target = -2?)

* manual mechanism for setting ndistinct that doesn't keep getting
overwritten by subsequent ANALYZEs

* have the executor do non-transactional update of the value of
ndistinct if it ever builds a hash table that is larger than expected
(simple learning optimizer action)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 12:59:18

On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote:

> Perhaps a better plan is to try to de-emphasize use of ndistinct,
> though I concede I have no idea how to do that.

We don't actually care about the accuracy of the ndistinct much, just
the accuracy of our answer to the question "given work_mem = X, is it
better to use a hash plan".

So we just need to scan the table until we can answer that question
accurately enough. i.e. a variable sized sample.

Perhaps we could store a probability distribution for various values of
work_mem, rather than a single ndistinct value.

Anyway, definitely handwaving now to stimulate ideas.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Alvaro Herrera

Date:

12 December 2008, 13:01:14

Robert Haas escribió:
> > Which raises the issue, if we could get better statistics by passing
> > the whole table, why not do that when VACUUM ANALYZE is run?
> 
> I think the reason is "because the next autovacuum would undo it".

Is there any way to "merge" the statistics?  i.e. if a full table scan
is done to compute precise statistics, and later a regular analyze scan
is done, then perhaps instead of clobbering the previous stats, you
merge them with the new ones, thus not completely losing those previous
ones.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: benchmarking the query planner

From

Alvaro Herrera

Date:

12 December 2008, 13:04:40

Tom Lane escribió:

> If you want ANALYZE to be cheap then you simply don't get to have a
> trustworthy value of ndistinct.

But then, maybe it's not all that critical that ANALYZE is cheap.  For
example, if we were to rework VACUUM ANALYZE so that on the same pass
that VACUUM cleans each heap page, a callback is called on the page to
grab the needed stats.

Partial vacuum is a roadblock for this though :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: benchmarking the query planner

From

Gregory Stark

Date:

12 December 2008, 13:06:08

Simon Riggs <simon@2ndQuadrant.com> writes:

> The amount of I/O could stay the same, just sample all rows on block.
> Lifting the sample size will help large tables. Will it be perfect? No.
> But I'll take "better" over "not working at all".

That will just raise the table size at which the problems start. It'll still
be a constant-sized sample.

It will also introduce strange biases. For instance in a clustered table it'll
think there are a lot more duplicates than there really are because it'll see
lots of similar values.

Incidentally we *do* do block sampling. We pick random blocks and then pick
random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
around then. It dramatically reduced the i/o requirements but there were long
discussions of how to do it without introducing biases.

> If we are going to quote literature we should believe all the
> literature. We can't just listen to some people that did a few tests
> with sample size, but then ignore the guy that designed the MS optimizer
> and many others.

I'm not sure what you're talking about regarding "some people that did a few
tests". I looked around for the paper I keep referencing and can't find it on
my laptop. I'll look for it online. But it included a section which was a
survey of past results from other papers and the best results required
stupidly large sample sizes to get anything worthwhile. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 13:11:08

On Fri, 2008-12-12 at 16:10 +0000, Gregory Stark wrote:
> Uhm, this is a survey of lots of different methods and does lots of
> analysis.
> I don't see any simple conclusions about stability. Perhaps I'm just
> missing
> it in the technical details. Could you point out exactly what part of
> the
> paper you're basing this on and what "stable" means?

I was echoing the comments in the ANALYZE code, which explain that we
use the Duj1 estimator because it is more stable across sample size, as
shown in table 5 on p.21 of the Haas Stokes report.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

"Kevin Grittner"

Date:

12 December 2008, 13:28:21

>>> "Robert Haas" <robertmhaas@gmail.com> wrote: 
>>  Which raises the issue, if we could get better statistics by
passing
>> the whole table, why not do that when VACUUM ANALYZE is run?
> 
> I think the reason is "because the next autovacuum would undo it".
The table has 32.4 million rows.
autovacuum_analyze_scale_factor is 0.1.
autovacuum_vacuum_scale_factor is 0.2.
We run a nightly VACUUM ANALYZE.
Deletes are rare.
Normal operations don't update more than a few thousand rows per day.
I know that normal operations never cause an autovacuum of this table.
Perhaps if there was a way to share this information with
PostgreSQL....
-Kevin

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 13:34:39

On Fri, 2008-12-12 at 17:05 +0000, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > The amount of I/O could stay the same, just sample all rows on block.
> > Lifting the sample size will help large tables. Will it be perfect? No.
> > But I'll take "better" over "not working at all".
> 
> That will just raise the table size at which the problems start. It'll still
> be a constant-sized sample.

Work with me here. I want to make the situation better. It still won't
be perfect, but is that an argument against any action at all?

> It will also introduce strange biases. For instance in a clustered table it'll
> think there are a lot more duplicates than there really are because it'll see
> lots of similar values.
> 
> Incidentally we *do* do block sampling. We pick random blocks and then pick
> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
> around then. It dramatically reduced the i/o requirements but there were long
> discussions of how to do it without introducing biases.

No, we pick random rows. On bigger tables, they get further apart
typically and so we miss any clustering. I mean that we should pick a
random block and read all rows on it.

> > If we are going to quote literature we should believe all the
> > literature. We can't just listen to some people that did a few tests
> > with sample size, but then ignore the guy that designed the MS optimizer
> > and many others.
> 
> I'm not sure what you're talking about regarding "some people that did a few
> tests". I looked around for the paper I keep referencing and can't find it on
> my laptop. I'll look for it online. But it included a section which was a
> survey of past results from other papers and the best results required
> stupidly large sample sizes to get anything worthwhile. 

Even if you find it, we still need to know why we would listen to the
research in the absent paper yet ignore the conclusion in the paper by
the man in charge of the MS optimizer who said that block level sampling
is a good idea.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:03:42

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribi�:
>> If you want ANALYZE to be cheap then you simply don't get to have a
>> trustworthy value of ndistinct.

> But then, maybe it's not all that critical that ANALYZE is cheap.  For
> example, if we were to rework VACUUM ANALYZE so that on the same pass
> that VACUUM cleans each heap page, a callback is called on the page to
> grab the needed stats.

> Partial vacuum is a roadblock for this though :-(

Yeah --- now that partial vacuum is in, any argument that we can make
ANALYZE piggyback on VACUUM cheaply is dead anyway.

It would be interesting to consider "partial analyze" processing, but I
don't see how you would combine per-page partial results without a huge
increase in stats-related state data.
        regards, tom lane

Re: benchmarking the query planner

From

"Greg Stark"

Date:

12 December 2008, 14:04:30

On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Incidentally we *do* do block sampling. We pick random blocks and then pick
>> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
>> around then. It dramatically reduced the i/o requirements but there were long
>> discussions of how to do it without introducing biases.
>
> No, we pick random rows. On bigger tables, they get further apart
> typically and so we miss any clustering. I mean that we should pick a
> random block and read all rows on it.

I think what's happening here is that our sampling method is based on
the assumption that a records location is not related to its value.

Consider a table which is clustered and has two copies of each value.
When we look at a block and see n/2 values and we know there are 1000
blocks then a human would conjecture that there are 1000*n/2 distinct
values throughout the table and every value is represented twice
throughout the whole table. But if we're assuming the records are
randomly distributed then looking at the whole block will actually
throw us off completely. We'll deduce from the fact that we saw every
value twice that there must be hundreds of copies spread throughout
the database and there must be a lot less than 1000*n/2 distinct
values.

I think you need to find two different formulas, one which represents
a clustered table and one which represents randomly distributed data.
Then you need a way to measure just how clustered the data is so you
know how much weight to give each formula. Perhaps comparing the
number of duplicates in whole-block samples versus overall random
selections would give that measure.

-- 
greg

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 14:08:17

On Fri, 2008-12-12 at 14:03 -0300, Alvaro Herrera wrote:

> Partial vacuum is a roadblock for this though :-(

Actually, perhaps its an enabler for looking at changed stats?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:10:55

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote:
>> Perhaps a better plan is to try to de-emphasize use of ndistinct,
>> though I concede I have no idea how to do that.

> We don't actually care about the accuracy of the ndistinct much, just
> the accuracy of our answer to the question "given work_mem = X, is it
> better to use a hash plan".

That's hardly the only thing we use ndistinct for.  Also, it's a bit too
simplistic to suppose that we only have to make the right binary choice
between hash and something else at a particular plan level.  If we don't
have at least ballpark-correct figures for cost and number of output
rows, we'll start making mistakes at higher plan levels.
        regards, tom lane

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:12:27

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Is there any way to "merge" the statistics?  i.e. if a full table scan
> is done to compute precise statistics, and later a regular analyze scan
> is done, then perhaps instead of clobbering the previous stats, you
> merge them with the new ones, thus not completely losing those previous
> ones.

Seems like a pretty hard problem unless you store a whole lot more
statistics state than we do now (which of course would create its own
costs).  How would you know which portion of the old stats to not
believe anymore?
        regards, tom lane

Re: benchmarking the query planner

From

Bruce Momjian

Date:

12 December 2008, 14:15:50

Alvaro Herrera wrote:
> Robert Haas escribi?:
> > > Which raises the issue, if we could get better statistics by passing
> > > the whole table, why not do that when VACUUM ANALYZE is run?
> > 
> > I think the reason is "because the next autovacuum would undo it".
> 
> Is there any way to "merge" the statistics?  i.e. if a full table scan
> is done to compute precise statistics, and later a regular analyze scan
> is done, then perhaps instead of clobbering the previous stats, you
> merge them with the new ones, thus not completely losing those previous
> ones.

Crazy idea, but if a partial analyze finds that 5% of the table has
changed since the last full analyze, but 10% of the statistics are
different, we know something is wrong.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: benchmarking the query planner

From

Bruce Momjian

Date:

12 December 2008, 14:17:11

Alvaro Herrera wrote:
> Tom Lane escribi?:
> 
> > If you want ANALYZE to be cheap then you simply don't get to have a
> > trustworthy value of ndistinct.
> 
> But then, maybe it's not all that critical that ANALYZE is cheap.  For
> example, if we were to rework VACUUM ANALYZE so that on the same pass
> that VACUUM cleans each heap page, a callback is called on the page to
> grab the needed stats.
> 
> Partial vacuum is a roadblock for this though :-(

Perhaps it isn't because partial vacuum is going to highlight the
_changed_ blocks, which fits into your idea of merging stats, somehow. ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 14:19:05

On Fri, 2008-12-12 at 18:01 +0000, Greg Stark wrote:

> I think you need to find two different formulas, one which represents
> a clustered table and one which represents randomly distributed data.
> Then you need a way to measure just how clustered the data is so you
> know how much weight to give each formula. Perhaps comparing the
> number of duplicates in whole-block samples versus overall random
> selections would give that measure.

Please read the Chaudhuri paper.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:19:44

Simon Riggs <simon@2ndQuadrant.com> writes:
> As I said, we would only increase sample for ndistinct, not for others.

How will you do that?  Keep in mind that one of the things we have to do
to compute ndistinct is to sort the sample.  ISTM that the majority of
the cost of a larger sample is going to get expended anyway ---
certainly we could form the histogram using the more accurate data at
precisely zero extra cost, and I think we have also pretty much done all
the work for MCV collection by the time we finish counting the number of
distinct values.

I seem to recall Greg suggesting that there were ways to estimate
ndistinct without sorting, but short of a fundamental algorithm change
there's not going to be a win here.

> Right now we may as well use a random number generator.

Could we skip the hyperbole please?
        regards, tom lane

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:21:57

Simon Riggs <simon@2ndQuadrant.com> writes:
> Solutions can also include
> * manual mechanism for setting ndistinct that doesn't keep getting
> overwritten by subsequent ANALYZEs

Hmm, that might actually be the most practical answer for large,
reasonably-static tables.  Especially if we expose the "negative
stadistinct" feature to let people specify it as a fraction of table
size.
        regards, tom lane

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 14:23:31

On Fri, 2008-12-12 at 13:10 -0500, Tom Lane wrote:

> If we don't
> have at least ballpark-correct figures for cost and number of output
> rows, we'll start making mistakes at higher plan levels.

That's definitely where we are now. 

Had a call with a customer today where I said "it's OK, its only out by
3 orders of magnitude, it gets worse than that". 9 hour query, with
substantially incorrect EXPLAIN.

I don't claim to have any/all of the answers myself, but this needs
attention very badly.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 14:33:03

On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote:

> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.

Hash table? Haas Stokes suggests a Bloom filter.

Why not keep the random algorithm we have now, but scan the block into a
separate hash table for ndistinct estimation. That way we keep the
correct random rows for other purposes.

> > Right now we may as well use a random number generator.
> 
> Could we skip the hyperbole please?

Some of the ndistinct values are very badly off, and in the common cases
I cited previously, consistently so.

Once I'm certain the rescue helicopter has seen me, I'll stop waving my
arms. (But yes, OK).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 14:44:07

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote:
>> Could we skip the hyperbole please?

> Some of the ndistinct values are very badly off, and in the common cases
> I cited previously, consistently so.

> Once I'm certain the rescue helicopter has seen me, I'll stop waving my
> arms. (But yes, OK).

Well, AFAICT we have agreed in this thread to kick up the default and
maximum stats targets by a factor of 10 for 8.4.  If there's anything
to your thesis that a bigger sample size will help, that should already
make a noticeable difference.  I'm inclined to wait for some field
experience with 8.4 before we start making fundamental changes to the
ANALYZE code.
        regards, tom lane

Re: benchmarking the query planner

From

Euler Taveira de Oliveira

Date:

12 December 2008, 14:48:13

Robert Haas escreveu:
>> Which raises the issue, if we could get better statistics by passing
>> the whole table, why not do that when VACUUM ANALYZE is run?
> 
> I think the reason is "because the next autovacuum would undo it".
> 
Ok, but even if autovacuum will undo it, almost-static-dataset would benefit
from this feature because (i) autovacuum won't often trigger this table or
(ii) you disabled the autovacuum on that table.

> Perhaps a table-level option to scan the whole table instead of
> estimating would be appropriate?
> 
ANALYZE FULL foo ?


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: benchmarking the query planner

From

"Robert Haas"

Date:

12 December 2008, 15:00:18

>> Perhaps a table-level option to scan the whole table instead of
>> estimating would be appropriate?
>>
> ANALYZE FULL foo ?

I was thinking more like a flag that you could set on the table
itself, that would apply to all future analyzes.

...Robert

Re: benchmarking the query planner

From

"Greg Stark"

Date:

12 December 2008, 15:09:19

On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.

Not sure if you meant me, but I can say the approach I saw documented
before involved keeping a hash of all values seen in a full table
scan. When the hash reached a maximum size you discard half the hash
key space and from then on ignore any values which hash into that
space. When you reach the end you have a hash table with a random
sample of 1/2^n distinct values (where n is the number of times the
hash table overflowed) and the counts for those values.

If you're just interested in counting distinct values in the sample I
suppose you could do it using a Bloom filter just as we've talked
about for other hash tables. You scan through the list and increment
an ndistinct counter for every value you find where the bits aren't
all already set (then set those bits). That would undercount slightly
and I'm not sure how much faster than qsort it would actually be. The
log(n) factor in qsort's average case is pretty small when we're
talking about small samples.

-- 
greg

Re: benchmarking the query planner

From

"Greg Stark"

Date:

12 December 2008, 15:11:19

On Fri, Dec 12, 2008 at 6:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Why not keep the random algorithm we have now, but scan the block into a
> separate hash table for ndistinct estimation. That way we keep the
> correct random rows for other purposes.

It seems to me that what you have to do is look at a set of blocks and
judge a) how many duplicates are in the typical block and b) how much
overlap there are between blocks. Then extrapolate to other blocks
based on those two values.

So for example if you look at 1% of the blocks and find there are 27
distinct values on each of the blocks then you extrapolate that there
are somewhere between 100*27 distinct values table-wide (if those
blocks have no intersections) and 27 distinct values total (if those
blocks intersect 100% -- ie, they all have the same 27 distinct
values).

I haven't had a chance to read that paper, it looks extremely dense.
Is this the same idea?

-- 
greg

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 15:20:42

On Fri, 2008-12-12 at 13:43 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote:
> >> Could we skip the hyperbole please?
> 
> > Some of the ndistinct values are very badly off, and in the common cases
> > I cited previously, consistently so.
> 
> > Once I'm certain the rescue helicopter has seen me, I'll stop waving my
> > arms. (But yes, OK).
> 
> Well, AFAICT we have agreed in this thread to kick up the default and
> maximum stats targets by a factor of 10 for 8.4.  If there's anything
> to your thesis that a bigger sample size will help, that should already
> make a noticeable difference.  

That only makes x10 sample size. Since we're using such a low sample
size already, it won't make much difference to ndistinct. It will be
great for histograms and MCVs though.

Please review my detailed test results mentioned here
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00153.php

If you reproduce those results you'll see that the ndistinct machinery
is fundamentally broken for clustered data on large tables. In many
cases those are join keys and so joins are badly handled on the very
tables where good optimisation is most important.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Ron Mayer

Date:

12 December 2008, 15:34:51

Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> The amount of I/O could stay the same, just sample all rows on block. [....]
> 
> It will also introduce strange biases. For instance in a clustered table it'll
> think there are a lot more duplicates than there really are because it'll see
> lots of similar values.

But for ndistinct - it seems it could only help things.  If the
ndistinct guesser just picks  max(the-current-one-row-per-block-guess,
a-guess-based-on-all-the-rows-on-the-blocks)
it seems we'd be no worse off for clustered tables; and much
better off for randomly organized tables.

In some ways I fear *not* sampling all rows on the block also
introduces strange biases by largely overlooking the fact that
the table's clustered.

In my tables clustered on zip-code we don't notice info like
"state='AZ' is present in well under 1% of blocks in the table",
while if we did scan all rows on the blocks it might guess this.
But I guess a histogram of blocks would be additional stat rather
than an improved one.

Re: benchmarking the query planner

From

Simon Riggs

Date:

12 December 2008, 15:45:24

On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Solutions can also include
> > * manual mechanism for setting ndistinct that doesn't keep getting
> > overwritten by subsequent ANALYZEs
> 
> Hmm, that might actually be the most practical answer for large,
> reasonably-static tables.  Especially if we expose the "negative
> stadistinct" feature to let people specify it as a fraction of table
> size.

Works for me. Especially if you want to think more about ANALYZE before
changing that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: benchmarking the query planner

From

Tom Lane

Date:

12 December 2008, 16:01:08

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> Solutions can also include
>>> * manual mechanism for setting ndistinct that doesn't keep getting
>>> overwritten by subsequent ANALYZEs
>> 
>> Hmm, that might actually be the most practical answer for large,
>> reasonably-static tables.  Especially if we expose the "negative
>> stadistinct" feature to let people specify it as a fraction of table
>> size.

> Works for me. Especially if you want to think more about ANALYZE before
> changing that.

Well, it's something that would be sane to contemplate adding in 8.4.
It's way too late for any of this other stuff to happen in this release.
        regards, tom lane

Re: benchmarking the query planner

From

Greg Smith

Date:

02 January 2009, 07:18:14

On Thu, 11 Dec 2008, Tom Lane wrote:

> On the whole I think we have some evidence here to say that upping the
> default value of default_stats_target to 100 wouldn't be out of line,
> but 1000 definitely is.  Comments?

Circling back to where this started from now that the discussion has 
slowed down here, the original tuning suggestions Josh threw out were:

> Mixed:
>  default_statistics_target = 100
> DW:
>  default_statistics_target = 400

I think this data suggesting serious quadratic behavior doesn't kick in 
until much higher than original theorized supports increasing the mixed 
number to 100.  And if the database default goes to there I think that's 
even better.  Compared to the current default of 10, that makes for a 4X 
increase in planning time for Robert's original test case, and a 5.8X 
increase in the more difficult array test Tom came up with.

The only question really open in my mind is whether 400 is still 
considered too high for a DW application.  Relative to 100, Tom's numbers 
suggest that further 4X increase ends up turning into as much as a 8X 
increase in planning time.  Given the context of a DW application, where 
bad plans are really expensive, that suggestion of 400 seems confirmed to 
be sane now to me.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: benchmarking the query planner

From

Robert Haas

Date:

11 March 2009, 22:11:44

On Fri, Dec 12, 2008 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote:
>>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>>> Solutions can also include
>>>> * manual mechanism for setting ndistinct that doesn't keep getting
>>>> overwritten by subsequent ANALYZEs
>>>
>>> Hmm, that might actually be the most practical answer for large,
>>> reasonably-static tables.  Especially if we expose the "negative
>>> stadistinct" feature to let people specify it as a fraction of table
>>> size.
>
>> Works for me. Especially if you want to think more about ANALYZE before
>> changing that.
>
> Well, it's something that would be sane to contemplate adding in 8.4.
> It's way too late for any of this other stuff to happen in this release.

I'm thinking about trying to implement this, unless someone else is
already planning to do it.  I'm not sure it's practical to think about
getting this into 8.4 at this point, but it's worth doing whether it
does or not.

The main problem I see here is that I don't know what the mechanism
should be.  My first thought was to use a reloption, but that won't
work because reloptions are per-table and this setting is per-column.
So my second thought is to add a column to pg_attribute and add a new
DDL commands to modify it, maybe something like:

ALTER TABLE name ALTER [COLUMN] column SET NDISTINCT 1672;
ALTER TABLE name ALTER [COLUMN] column DROP NDISTINCT;

Another option would be to invent a reloption-like syntax for columns:

ALTER TABLE name ALTER [COLUMN] column SET (ndistinct = 1672);
ALTER TABLE name ALTER [COLUMN] column RESET (ndistinct);

This sort of seems cleaner and I can imagine it being useful for other
purposes, but I'm not sure if I want to go to the trouble of building
a completely general column-level reloptions mechanism ATM. It's also
not entirely clear to me whether this option, wherever we put it,
should be directly examined by the selectivity functions before
consulting the statistics tuple, or whether it should merely override
the value that ANALYZE puts into the statistics tuple.  The latter
seems simpler and more performant to me, but does lead to the possibly
surprising result that changes don't take effect until the next
ANALYZE.

Thoughts?

...Robert

Re: benchmarking the query planner

From

ITAGAKI Takahiro

Date:

19 March 2009, 05:04:14

Robert Haas <robertmhaas@gmail.com> wrote:

> >> Works for me. Especially if you want to think more about ANALYZE before
> >> changing that.
> >
> > Well, it's something that would be sane to contemplate adding in 8.4.
> > It's way too late for any of this other stuff to happen in this release.
> 
> I'm thinking about trying to implement this, unless someone else is
> already planning to do it.  I'm not sure it's practical to think about
> getting this into 8.4 at this point, but it's worth doing whether it
> does or not.

Can we use get_relation_stats_hook on 8.4? The pg_statistic catalog
will be still modified by ANALYZEs, but we can rewrite the statistics
just before it is used.

your_relation_stats_hook(root, rte, attnum, vardata)
{   Call default implementation;   if (rte->relid = YourRelation && attnum = YourColumn)       ((Form_pg_statistic)
(vardata->statsTuple))->stadistinct= YourNDistinct;
 
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: benchmarking the query planner

From

Robert Haas

Date:

02 April 2009, 23:26:50

On Thu, Mar 19, 2009 at 4:04 AM, ITAGAKI Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> >> Works for me. Especially if you want to think more about ANALYZE before
>> >> changing that.
>> >
>> > Well, it's something that would be sane to contemplate adding in 8.4.
>> > It's way too late for any of this other stuff to happen in this release.
>>
>> I'm thinking about trying to implement this, unless someone else is
>> already planning to do it.  I'm not sure it's practical to think about
>> getting this into 8.4 at this point, but it's worth doing whether it
>> does or not.
>
> Can we use get_relation_stats_hook on 8.4? The pg_statistic catalog
> will be still modified by ANALYZEs, but we can rewrite the statistics
> just before it is used.
>
> your_relation_stats_hook(root, rte, attnum, vardata)
> {
>    Call default implementation;
>    if (rte->relid = YourRelation && attnum = YourColumn)
>        ((Form_pg_statistic) (vardata->statsTuple))->stadistinct = YourNDistinct;
> }

I don't know, can you run a query from inside the stats hook?  It
sounds like this could be made to work for a hard-coded relation and
column, but ideally you'd like to get this data out of a table
somewhere.

I started implementing this by adding attdistinct to pg_attribute and
making it a float8, with 0 meaning "don't override the results of the
normal stats computation" and any other value meaning "override the
results of the normal stats computation with this value".  I'm not
sure, however, whether I can count on the result of an equality test
against a floating-point zero to be reliable on every platform.    It
also seems like something of a waste of space, since the only positive
values that are useful are integers (and presumably less than 2^31-1)
and the only negative values that are useful are > -1.  So I'm
thinking about making it an integer, to be interpreted as follows:

0 => compute ndistinct normally
positive value => use this value for ndistinct
negative value => use this value * 10^-6 for ndistinct

Any thoughts?

...Robert