Thread: [GENERAL] how to get accurate values in pg_statistic (continued)

[GENERAL] how to get accurate values in pg_statistic (continued)

From
Mary Edie Meredith
Date:
Our port of OSDL DBT3 test suite to PostgreSQL (see Background
information below) is nearing completion.  We would also like to confirm
our understanding of an outstanding consistency issue.

We have not been able to do meaningful kernel testing since the runs
(all parameters/kernels being equal) arewildly varying - sometimes
20-25% differences in the metrics run to run.

We found plans were changing from test run to test run. In one case a
plan ran 20 minutes in the throughput test of one run, and 2 seconds in
another run!  By forcing the contents of pg_statistics to be the same
before the queries run, we have consistent results now. So we know for
sure the problem is due to the random nature of the stats sampling: the
optimizer always saw different stats data resulting in different plans.

Stephan Szabo kindly responded to our earlier queries suggesting we look
at default_statistics_target and ALTER TABLE ALTER COLUMN SET
STATISTICS.

These determine the number of bins in the histogram for a given column.
But for a large number of rows (for example 6 million) the maximum value
(1000) does not guarantee that ANALYZE will do a full scan of the table.
We do not see a way to guarantee the same statistics run to run without
forcing ANALYZE to examine every row of every table.

Are we wrong in our analysis?

Are there main-stream alternatives we have missed?

How do you do testing on large tables and make the execution plans
consistent?

Is there a change to ANALYZE in 7.4 that solves our problem?

TIA.


********************************************************************
Background information:

Database Test 3 (DBT-3) is a decision support workload.

The test kit itself has been executing on PostgreSQL for some time, is
available on sourceforge, and is implemented on our Scalable Test
Platform (STP).


A bit of background:  The test
(1) builds a database from load files, gathers statistics,
(2) runs a single stream of 22 queries plus a set of inserts and deletes
(the power test), then
(3) runs a multiple stream of the queries with one added stream of
inserts/deletes (the throughput test).


--
Mary Edie Meredith <maryedie@osdl.org>
Open Source Development Lab


Re: [GENERAL] how to get accurate values in pg_statistic

From
Rod Taylor
Date:
On Thu, 2003-09-04 at 13:41, Mary Edie Meredith wrote:
> Our port of OSDL DBT3 test suite to PostgreSQL (see Background
> information below) is nearing completion.  We would also like to confirm
> our understanding of an outstanding consistency issue.
>
> We have not been able to do meaningful kernel testing since the runs
> (all parameters/kernels being equal) arewildly varying - sometimes
> 20-25% differences in the metrics run to run.

Run a VACUUM FULL ANALYZE between runs.  This will force a full scan of
all data for stats, as well as ensure the table is consistently
compacted.

Attachment

Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From
Tom Lane
Date:
Mary Edie Meredith <maryedie@osdl.org> writes:
> Stephan Szabo kindly responded to our earlier queries suggesting we look
> at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> STATISTICS.

> These determine the number of bins in the histogram for a given column.
> But for a large number of rows (for example 6 million) the maximum value
> (1000) does not guarantee that ANALYZE will do a full scan of the table.
> We do not see a way to guarantee the same statistics run to run without
> forcing ANALYZE to examine every row of every table.

Do you actually still have a problem with the plans changing when the
stats target is above 100 or so?  I think the notion of "force ANALYZE
to do a full scan" is inherently wrongheaded ... it certainly would not
produce numbers that have anything to do with ordinary practice.

If you have data statistics that are so bizarre that the planner still
gets things wrong with a target of 1000, then I'd like to know more
about why.

            regards, tom lane

Re: [GENERAL] how to get accurate values in pg_statistic

From
Neil Conway
Date:
On Thu, 2003-09-04 at 13:46, Rod Taylor wrote:
> Run a VACUUM FULL ANALYZE between runs.  This will force a full scan of
> all data for stats

It will? Are you sure about that?

-Neil



Re: [GENERAL] how to get accurate values in pg_statistic

From
Mary Edie Meredith
Date:
The documentation lead us to believe that it would not, but we are
testing just the same (at least checking that the pg_statistics are the
same after each load and VACUUM FULL ANALYZE).  Will report back.



On Thu, 2003-09-04 at 16:50, Neil Conway wrote:
> On Thu, 2003-09-04 at 13:46, Rod Taylor wrote:
> > Run a VACUUM FULL ANALYZE between runs.  This will force a full scan of
> > all data for stats
>
> It will? Are you sure about that?
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
--
Mary Edie Meredith <maryedie@osdl.org>
Open Source Development Lab


Re: [GENERAL] how to get accurate values in pg_statistic

From
Rod Taylor
Date:
On Thu, 2003-09-04 at 19:50, Neil Conway wrote:
> On Thu, 2003-09-04 at 13:46, Rod Taylor wrote:
> > Run a VACUUM FULL ANALYZE between runs.  This will force a full scan of
> > all data for stats
>
> It will? Are you sure about that?

You're right. According to the docs it won't.

I had a poor stats issue on one table that was solved using that
command, coincidentally apparently.

Attachment

Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From
Tom Lane
Date:
Mary Edie Meredith <maryedie@osdl.org> writes:
> For our testing purposes, however, consistent results are extremely
> important. We have observed that small difference in one plan for one of
> 22 queries can cause a difference in the DBT-3 results.  If this
> happens, a small change in performance runs between two Linux kernels
> may appear to be due to the kernels, when in fact it is due to the plan
> change.

Fair enough.  If you are trying to force exactly repeatable results,
why don't you just "set seed = 0" before you ANALYZE?  There's only
one random-number generator, so that should force ANALYZE to make the
same random sampling every time.

Also, it'd be a good idea to ANALYZE the needed tables by name,
explicitly, to ensure that they are analyzed in a known order
rather than whatever order ANALYZE happens to find them in pg_class.

            regards, tom lane

Re: [GENERAL] how to get accurate values in pg_statistic

From
Mary Edie Meredith
Date:
I certainly don't claim that it is appropriate to force customers into a
full analysis, particularly if random sampling versus a full scan of the
data reveals little to no performance differences in the plans.  Being
able to sample accurately is _very nice for large tables.

For our testing purposes, however, consistent results are extremely
important. We have observed that small difference in one plan for one of
22 queries can cause a difference in the DBT-3 results.  If this
happens, a small change in performance runs between two Linux kernels
may appear to be due to the kernels, when in fact it is due to the plan
change.

We know that the plans are _exactly the same if the data in the
pg_statistics table is the same from run to run (all other things being
equal).  So what we need to have is identical optimizer costs
(pg_statistics) for the same table data for each.

I feel certain that the pg_statistics table will be identical from run
to run if analyze looks at every row.   Thus our hope to find a way to
get that.

We did runs over night.  We can confirm that VACUUM FULL ANALYZE does
not produce the same pg_statistics run to run.  With the default (10)
default_statistics_target the plans are also different.

We ran additional tests with default_statistics_target set to 1000 (the
max I believe).  The plans are the same over the different runs, but the
pg_statistics table has different cost values.  The performance results
of the runs are consistent (we would expect this with the same plans).
The resulting performance metrics are similar to the best plans we see
using the default histogram size (good news).

However, we worry that one day the cost will change enough for whatever
reason to cause a plan change, especially for a larger database scale
factor (database size/row size).

I know we appear to be an isolated case, but customers also do testing
and may have the same consistency issues we have.  I can also imagine
cases where customers want to guarantee that plans stay the same
(between replicated sites, for example).  If two developers are
analyzing changes to the optimizer, don't you want the costs used for
testing on their two systems to be identical for comparison purposes?

Anyway, IMHO I believe that an option for an ANALYZE FULL ("sampling"
all rows) would be valuable.   Any other ideas for how to force this
without code change are very welcome.

Thanks for your info!



On Thu, 2003-09-04 at 16:16, Tom Lane wrote:
> Mary Edie Meredith <maryedie@osdl.org> writes:
> > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > STATISTICS.
>
> > These determine the number of bins in the histogram for a given column.
> > But for a large number of rows (for example 6 million) the maximum value
> > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > We do not see a way to guarantee the same statistics run to run without
> > forcing ANALYZE to examine every row of every table.
>
> Do you actually still have a problem with the plans changing when the
> stats target is above 100 or so?  I think the notion of "force ANALYZE
> to do a full scan" is inherently wrongheaded ... it certainly would not
> produce numbers that have anything to do with ordinary practice.
>
> If you have data statistics that are so bizarre that the planner still
> gets things wrong with a target of 1000, then I'd like to know more
> about why.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Mary Edie Meredith <maryedie@osdl.org>
Open Source Development Lab


Re: [GENERAL] how to get accurate values in pg_statistic

From
Bruce Momjian
Date:
Mary Edie Meredith wrote:
> I certainly don't claim that it is appropriate to force customers into a
> full analysis, particularly if random sampling versus a full scan of the
> data reveals little to no performance differences in the plans.  Being
> able to sample accurately is _very nice for large tables.
>
> For our testing purposes, however, consistent results are extremely
> important. We have observed that small difference in one plan for one of
> 22 queries can cause a difference in the DBT-3 results.  If this
> happens, a small change in performance runs between two Linux kernels
> may appear to be due to the kernels, when in fact it is due to the plan
> change.
>
> We know that the plans are _exactly the same if the data in the
> pg_statistics table is the same from run to run (all other things being
> equal).  So what we need to have is identical optimizer costs
> (pg_statistics) for the same table data for each.
>
> I feel certain that the pg_statistics table will be identical from run
> to run if analyze looks at every row.   Thus our hope to find a way to
> get that.


Actually, if you are usig GEQO (many tables in a join) the optimizer
itself will randomly try plans --- even worse than random statistics.

We do have:

    #geqo_random_seed = -1          # -1 = use variable seed

that lets you force a specific random seed for testing purposes.  I
wonder if that could be extended to control VACUUM radomization too.
Right now, it just controls GEQO and in fact gets reset on every
optimizer run.

I wonder if you could just poke a srandom(10) in
src/backend/command/analyze.c.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [GENERAL] how to get accurate values in pg_statistic

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> We do have:
>     #geqo_random_seed = -1          # -1 = use variable seed

> that lets you force a specific random seed for testing purposes.  I
> wonder if that could be extended to control VACUUM radomization too.
> Right now, it just controls GEQO and in fact gets reset on every
> optimizer run.

Actually, just the other day I was thinking we should take that out.
Since there is only one random number generator in the C library,
GEQO is messing with everyone else's state every time it decides to do
an srandom().  And there is certainly no need to do an explicit srandom
with a "random" seed every time through the optimizer, which is the
code's default behavior at the moment.  That just decreases the
randomness AFAICS, compared to letting the established sequence run.

            regards, tom lane

Re: [GENERAL] how to get accurate values in pg_statistic

From
Bruce Momjian
Date:
I have learned you can use:

    SET random = 0;

to force identical statistics every time you run ANALYZE.

---------------------------------------------------------------------------

Mary Edie Meredith wrote:
> I certainly don't claim that it is appropriate to force customers into a
> full analysis, particularly if random sampling versus a full scan of the
> data reveals little to no performance differences in the plans.  Being
> able to sample accurately is _very nice for large tables.
>
> For our testing purposes, however, consistent results are extremely
> important. We have observed that small difference in one plan for one of
> 22 queries can cause a difference in the DBT-3 results.  If this
> happens, a small change in performance runs between two Linux kernels
> may appear to be due to the kernels, when in fact it is due to the plan
> change.
>
> We know that the plans are _exactly the same if the data in the
> pg_statistics table is the same from run to run (all other things being
> equal).  So what we need to have is identical optimizer costs
> (pg_statistics) for the same table data for each.
>
> I feel certain that the pg_statistics table will be identical from run
> to run if analyze looks at every row.   Thus our hope to find a way to
> get that.
>
> We did runs over night.  We can confirm that VACUUM FULL ANALYZE does
> not produce the same pg_statistics run to run.  With the default (10)
> default_statistics_target the plans are also different.
>
> We ran additional tests with default_statistics_target set to 1000 (the
> max I believe).  The plans are the same over the different runs, but the
> pg_statistics table has different cost values.  The performance results
> of the runs are consistent (we would expect this with the same plans).
> The resulting performance metrics are similar to the best plans we see
> using the default histogram size (good news).
>
> However, we worry that one day the cost will change enough for whatever
> reason to cause a plan change, especially for a larger database scale
> factor (database size/row size).
>
> I know we appear to be an isolated case, but customers also do testing
> and may have the same consistency issues we have.  I can also imagine
> cases where customers want to guarantee that plans stay the same
> (between replicated sites, for example).  If two developers are
> analyzing changes to the optimizer, don't you want the costs used for
> testing on their two systems to be identical for comparison purposes?
>
> Anyway, IMHO I believe that an option for an ANALYZE FULL ("sampling"
> all rows) would be valuable.   Any other ideas for how to force this
> without code change are very welcome.
>
> Thanks for your info!
>
>
>
> On Thu, 2003-09-04 at 16:16, Tom Lane wrote:
> > Mary Edie Meredith <maryedie@osdl.org> writes:
> > > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > > STATISTICS.
> >
> > > These determine the number of bins in the histogram for a given column.
> > > But for a large number of rows (for example 6 million) the maximum value
> > > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > > We do not see a way to guarantee the same statistics run to run without
> > > forcing ANALYZE to examine every row of every table.
> >
> > Do you actually still have a problem with the plans changing when the
> > stats target is above 100 or so?  I think the notion of "force ANALYZE
> > to do a full scan" is inherently wrongheaded ... it certainly would not
> > produce numbers that have anything to do with ordinary practice.
> >
> > If you have data statistics that are so bizarre that the planner still
> > gets things wrong with a target of 1000, then I'd like to know more
> > about why.
> >
> >             regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> --
> Mary Edie Meredith <maryedie@osdl.org>
> Open Source Development Lab
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [GENERAL] how to get accurate values in pg_statistic

From
Greg Stark
Date:
Mary Edie Meredith <maryedie@osdl.org> writes:

> We ran additional tests with default_statistics_target set to 1000 (the
> max I believe).  The plans are the same over the different runs, but the
> pg_statistics table has different cost values.  The performance results
> of the runs are consistent (we would expect this with the same plans).
> The resulting performance metrics are similar to the best plans we see
> using the default histogram size (good news).

Hm, would it be possible to do a binary search and find the target at which
you start getting consistent plans? Perhaps the default of 10 is simply way
too small and should be raised?

Obviously this would depend on the data model, but I suspect if your aim is
for the benchmark data to be representative of typical data models, which
scares me into thinking perhaps users are seeing similarly unpredictably
variable performance.

--
greg

Re: [GENERAL] how to get accurate values in pg_statistic

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Perhaps the default of 10 is simply way
> too small and should be raised?

I've suspected since the default existed that it might be too small ;-).
No one's yet done any experiments to try to establish a better default,
though.  I suppose the first hurdle is to find a representative dataset.

            regards, tom lane

Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From
Bruce Momjian
Date:
Tom Lane wrote:
> Mary Edie Meredith <maryedie@osdl.org> writes:
> > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > STATISTICS.
>
> > These determine the number of bins in the histogram for a given column.
> > But for a large number of rows (for example 6 million) the maximum value
> > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > We do not see a way to guarantee the same statistics run to run without
> > forcing ANALYZE to examine every row of every table.
>
> Do you actually still have a problem with the plans changing when the
> stats target is above 100 or so?  I think the notion of "force ANALYZE
> to do a full scan" is inherently wrongheaded ... it certainly would not
> produce numbers that have anything to do with ordinary practice.
>
> If you have data statistics that are so bizarre that the planner still
> gets things wrong with a target of 1000, then I'd like to know more
> about why.

Has there been any progress in determining if the number of default
buckets (10) is the best value?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From
Christopher Browne
Date:
pgman@candle.pha.pa.us (Bruce Momjian) writes:
> Tom Lane wrote:
>> Mary Edie Meredith <maryedie@osdl.org> writes:
>> > Stephan Szabo kindly responded to our earlier queries suggesting
>> > we look at default_statistics_target and ALTER TABLE ALTER COLUMN
>> > SET STATISTICS.
>>
>> > These determine the number of bins in the histogram for a given
>> > column.  But for a large number of rows (for example 6 million)
>> > the maximum value (1000) does not guarantee that ANALYZE will do
>> > a full scan of the table.  We do not see a way to guarantee the
>> > same statistics run to run without forcing ANALYZE to examine
>> > every row of every table.
>>
>> Do you actually still have a problem with the plans changing when
>> the stats target is above 100 or so?  I think the notion of "force
>> ANALYZE to do a full scan" is inherently wrongheaded ... it
>> certainly would not produce numbers that have anything to do with
>> ordinary practice.
>>
>> If you have data statistics that are so bizarre that the planner
>> still gets things wrong with a target of 1000, then I'd like to
>> know more about why.
>
> Has there been any progress in determining if the number of default
> buckets (10) is the best value?

I would think this is much more the key to the issue for their
benchmark than issues of correctly replicating the random number
generator.

I'm not clear on how data is collected into the histogram bins;
obviously it's not selecting all 6 million rows, but how many rows is
it?

The "right answer" for most use seems likely to involve:

 a) Getting an appropriate number of bins (I suspect 10 is a bit
    small, but I can't justify that mathematically), and
 b) Attaching an appropriate sample size to those bins.

What is apparently going wrong with the benchmark (and this can
doubtless arise in "real life," too) is that the random selection is
pulling too few records with the result that some of the bins are
being filled in a "skewed" manner that causes the optimizer to draw
the wrong conclusions.  (I may merely be restating the obvious here,
but if I say it a little differently than it has been said before,
someone may notice the vital "wrong assumption.")

If the samples are crummy, then perhaps:
 - There need to be more bins
 - There need to be more samples

Does the sample size change if you increase the number of bins?  If
not, then having more, smaller bins will lead to them getting
increasingly skewed if there is any accidental skew in the selection.

Do we also need a parameter to control sample size?
--
output = reverse("ofni.smrytrebil" "@" "enworbbc")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: [osdldbt-general] Re: [GENERAL] how to get accurate

From
Mary Edie Meredith
Date:
We tried 1000 as the default and found that the plans were good plans
and were consistent, but the pg_statistics was not exactly the same.

We took Tom's' advice and tried SET SEED=0 (actually select setseed (0)
).

We did runs last night on our project machine which produced consistent
pg_statistics data and (of course) the same plans.

We will next try runs where we vary the default buckets.  Other than 10
and 1000, what numbers would you like us to try besides.  Previously the
number 100 was mentioned.  Are there others?

On Wed, 2003-09-10 at 12:44, Bruce Momjian wrote:
> Tom Lane wrote:
> > Mary Edie Meredith <maryedie@osdl.org> writes:
> > > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > > STATISTICS.
> >
> > > These determine the number of bins in the histogram for a given column.
> > > But for a large number of rows (for example 6 million) the maximum value
> > > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > > We do not see a way to guarantee the same statistics run to run without
> > > forcing ANALYZE to examine every row of every table.
> >
> > Do you actually still have a problem with the plans changing when the
> > stats target is above 100 or so?  I think the notion of "force ANALYZE
> > to do a full scan" is inherently wrongheaded ... it certainly would not
> > produce numbers that have anything to do with ordinary practice.
> >
> > If you have data statistics that are so bizarre that the planner still
> > gets things wrong with a target of 1000, then I'd like to know more
> > about why.
>
> Has there been any progress in determining if the number of default
> buckets (10) is the best value?
--
Mary Edie Meredith <maryedie@osdl.org>
Open Source Development Lab


Re: [osdldbt-general] Re: [GENERAL] how to get accurate

From
Christopher Browne
Date:
The world rejoiced as maryedie@osdl.org (Mary Edie Meredith) wrote:
> We tried 1000 as the default and found that the plans were good
> plans and were consistent, but the pg_statistics was not exactly the
> same.
>
> We took Tom's' advice and tried SET SEED=0 (actually select setseed
> (0) ).

When you're trying to get strict replicability of results, setting the
seed to some specific value is necessary.

Some useful results could be attained by varying the seed, and seeing
how the plans change.

> We did runs last night on our project machine which produced
> consistent pg_statistics data and (of course) the same plans.

> We will next try runs where we vary the default buckets.  Other than
> 10 and 1000, what numbers would you like us to try besides.
> Previously the number 100 was mentioned.  Are there others?

That presumably depends on what your goal is.

A useful experiment would be to see at what point (e.g. - at what
bucket size) plans tend to "settle down" to the right values.

It might well be that defaulting to 23 buckets (I'm picking that out
of thin air) would cause the plans to typically be stable whatever
seed got used.

A test for this would be to, for each bucket size value, repeatedly
ANALYZE and check query plans.

At bucket size 10, you have seen the query plans vary quite a bit.

At 1000, they seem to stabilize very well.

The geometric centre, between 10 and 1000, is 100, so it would surely
be useful to see if query plans are stable at that bucket size.

The most interesting number to know would be the lowest number of
buckets at which query plans are nearly always stable.  Supposing that
number was 23 (the number I earlier pulled out of the air), then that
can be used as evidence that the default value for SET STATISTICS
should be changed from 10 to 23.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('aa454','freenet.carleton.ca').
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
Sturgeon's Law: 90% of *EVERYTHING* is crud.

Re: [GENERAL] how to get accurate values in pg_statistic (continued)

From
Tom Lane
Date:
Christopher Browne <cbbrowne@libertyrms.info> writes:
> The "right answer" for most use seems likely to involve:
>  a) Getting an appropriate number of bins (I suspect 10 is a bit
>     small, but I can't justify that mathematically), and

I suspect that also, but I don't have real evidence for it either.
We've heard complaints from a number of people for whom it was indeed
too small ... but that doesn't prove it's not appropriate in the
majority of cases ...

> Does the sample size change if you increase the number of bins?

Yes, read the comments in backend/commands/analyze.c.

> Do we also need a parameter to control sample size?

Not if the paper I read before writing that code is correct.

            regards, tom lane

Re: [GENERAL] how to get accurate values in pg_statistic

From
"scott.marlowe"
Date:
On Thu, 11 Sep 2003, Tom Lane wrote:

> Christopher Browne <cbbrowne@libertyrms.info> writes:
> > The "right answer" for most use seems likely to involve:
> >  a) Getting an appropriate number of bins (I suspect 10 is a bit
> >     small, but I can't justify that mathematically), and
>
> I suspect that also, but I don't have real evidence for it either.
> We've heard complaints from a number of people for whom it was indeed
> too small ... but that doesn't prove it's not appropriate in the
> majority of cases ...
>
> > Does the sample size change if you increase the number of bins?
>
> Yes, read the comments in backend/commands/analyze.c.
>
> > Do we also need a parameter to control sample size?
>
> Not if the paper I read before writing that code is correct.

I was just talking to a friend of mine who does statistical analysis, and
he suggested a different way of looking at this.  I know little of the
analyze.c, but I'll be reading it some today.

His theory was that we can figure out the number of target bins by
basically running analyze twice with two different random seeds, and
initially setting the bins to 10.

The, compare the variance of the two runs.  If the variance is great,
increase the target by X, and run two again.  repeat, wash, rinse, until
the variance drops below some threshold.

I like the idea, I'm not at all sure if it's practical for Postgresql to
implement it.


Re: [GENERAL] how to get accurate values in pg_statistic

From
Christopher Browne
Date:
scott.marlowe@ihs.com ("scott.marlowe") writes:
> On Thu, 11 Sep 2003, Tom Lane wrote:
>
>> Christopher Browne <cbbrowne@libertyrms.info> writes:
>> > The "right answer" for most use seems likely to involve:
>> >  a) Getting an appropriate number of bins (I suspect 10 is a bit
>> >     small, but I can't justify that mathematically), and
>>
>> I suspect that also, but I don't have real evidence for it either.
>> We've heard complaints from a number of people for whom it was indeed
>> too small ... but that doesn't prove it's not appropriate in the
>> majority of cases ...
>>
>> > Does the sample size change if you increase the number of bins?
>>
>> Yes, read the comments in backend/commands/analyze.c.
>>
>> > Do we also need a parameter to control sample size?
>>
>> Not if the paper I read before writing that code is correct.
>
> I was just talking to a friend of mine who does statistical analysis, and
> he suggested a different way of looking at this.  I know little of the
> analyze.c, but I'll be reading it some today.
>
> His theory was that we can figure out the number of target bins by
> basically running analyze twice with two different random seeds, and
> initially setting the bins to 10.
>
> The, compare the variance of the two runs.  If the variance is great,
> increase the target by X, and run two again.  repeat, wash, rinse, until
> the variance drops below some threshold.
>
> I like the idea, I'm not at all sure if it's practical for Postgresql to
> implement it.

It may suffice to do some analytic runs on some "reasonable datasets"
in order to come up with a better default than 10.

If you run this process a few times on some different databases and
find that the variance keeps dropping pretty quickly, then that would
be good material for arguing that 10 should change to 17 or 23 or 31
or some such value.  (The only interesting pttern in that is that
those are all primes :-).)
--
output = ("cbbrowne" "@" "libertyrms.info")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: [GENERAL] how to get accurate values in pg_statistic

From
"scott.marlowe"
Date:
On Thu, 11 Sep 2003, Christopher Browne wrote:

> scott.marlowe@ihs.com ("scott.marlowe") writes:
> > On Thu, 11 Sep 2003, Tom Lane wrote:
> >
> >> Christopher Browne <cbbrowne@libertyrms.info> writes:
> >> > The "right answer" for most use seems likely to involve:
> >> >  a) Getting an appropriate number of bins (I suspect 10 is a bit
> >> >     small, but I can't justify that mathematically), and
> >>
> >> I suspect that also, but I don't have real evidence for it either.
> >> We've heard complaints from a number of people for whom it was indeed
> >> too small ... but that doesn't prove it's not appropriate in the
> >> majority of cases ...
> >>
> >> > Does the sample size change if you increase the number of bins?
> >>
> >> Yes, read the comments in backend/commands/analyze.c.
> >>
> >> > Do we also need a parameter to control sample size?
> >>
> >> Not if the paper I read before writing that code is correct.
> >
> > I was just talking to a friend of mine who does statistical analysis, and
> > he suggested a different way of looking at this.  I know little of the
> > analyze.c, but I'll be reading it some today.
> >
> > His theory was that we can figure out the number of target bins by
> > basically running analyze twice with two different random seeds, and
> > initially setting the bins to 10.
> >
> > The, compare the variance of the two runs.  If the variance is great,
> > increase the target by X, and run two again.  repeat, wash, rinse, until
> > the variance drops below some threshold.
> >
> > I like the idea, I'm not at all sure if it's practical for Postgresql to
> > implement it.
>
> It may suffice to do some analytic runs on some "reasonable datasets"
> in order to come up with a better default than 10.
>
> If you run this process a few times on some different databases and
> find that the variance keeps dropping pretty quickly, then that would
> be good material for arguing that 10 should change to 17 or 23 or 31
> or some such value.  (The only interesting pttern in that is that
> those are all primes :-).)

That's a good intermediate solution, but it really doesn't solve
everyone's issue.  If one table/field has a nice even distribution (i.e.
10 rows with id 1, 10 rows with id2, so on and so on) then it won't need
nearly as high of a default target as a row with lots of weird spikes and
such in it.

That's why Joe (my statistics friend) made the point about iterating over
each table with higher targets until the variance drops to something
reasonable.

I would imagine a simple script would be a good proof of concept of this,
but in the long run, it would be a huge win if the analyze.c code did this
automagically eventually, so that you don't have a target that's still too
low for some complex data sets and too high for simple ones.

Well, time for me to get to work on a proof of concept...