Re: Better default_statistics_target - Mailing list pgsql-patches

From Chris Browne
Subject Re: Better default_statistics_target
Date
Msg-id 608x48yk5f.fsf@dba2.int.libertyrms.com
Whole thread Raw
In response to Re: Better default_statistics_target  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Better default_statistics_target  (Gregory Stark <stark@enterprisedb.com>)
Re: Better default_statistics_target  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-patches
guillaume.smet@gmail.com ("Guillaume Smet") writes:
> On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg@turnstep.com> wrote:
>> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
>> there a reason not to make this change? I know I've been lazy and not run
>> any absolute figures, but rough tests show that raising it (from 10 to
>> 100) results in a very minor increase in analyze time, even for large
>> databases. I think the burden of a slightly slower analyze time, which
>> can be easily adjusted, both in postgresql.conf and right before running
>> an analyze, is very small compared to the pain of some queries - which worked
>> before - suddenly running much, much slower for no apparent reason at all.
>
> As Tom stated it earlier, the ANALYZE slow down is far from being the
> only consequence. The planner will also have more work to do and
> that's the hard point IMHO.
>
> Without studying the impacts of this change on a large set of queries
> in different cases, it's quite hard to know for sure that it won't
> have a negative impact in a lot of cases.
>
> It's a bit too late in the cycle to change that IMHO, especially
> without any numbers.

I have the theory (thus far not borne out by any numbers) that it
might be a useful approach to try to go through the DB schema and use
what information is there to try to come up with better numbers on a
per-column basis.

As a "first order" perspective on things:

 - Any columns marked "unique" could keep to having somewhat smaller
   numbers of bins in the histogram because we know that uniqueness
   will keep values dispersed at least somewhat.

   Ditto for "SERIAL" types.

 - Columns NOT marked unique should imply adding some bins to the
   histogram.

 - Datestamps tend to imply temporal dispersion, ergo "somewhat fewer
   bins."  Similar for floats.

 - Discrete values (integer, text) frequently see less dispersion,
   -> "more bins"

Then could come a "second order" perspective, where data would
actually get sampled from pg_statistics.

 - If we look at the number of distinct histogram bins used, for a
   particular column, and find that there are some not used, we might
   drop bins.

 - We might try doing some summary statistics to see how many unique
   values there actually are, on each column, and increase the number of
   bins if they're all in use, and there are other values that *are*
   frequently used.

   Maybe cheaper, if we find that pg_statistics tells us that all bins
   are in use, and extrapolation shows that there's a lot of the table
   NOT represented, we increase the number of bins.

There might even be a "third order" analysis, where you'd try to
collect additional data from the table, and analytically try to
determine appropriate numbers of bins...

Thus, we don't have a universal increase in the amount of statistics
collected - the added stats are localized to places where there is
some reason to imagine them useful.
--
let name="cbbrowne" and tld="acm.org" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/nonrdbms.html
There was a young lady of Crewe
Whose limericks stopped at line two.

pgsql-patches by date:

Previous
From: Andrew Chernow
Date:
Subject: Re: PQParam version 0.5
Next
From: Gregory Stark
Date:
Subject: Re: Better default_statistics_target