Re: Idea for the statistics collector - Mailing list pgsql-general

From Doug Fields
Subject Re: Idea for the statistics collector
Date
Msg-id 5.1.0.14.2.20020621105749.02b99b78@pop.pexicom.com
Whole thread Raw
In response to Re: Idea for the statistics collector  (Jan Wieck <JanWieck@Yahoo.com>)
Responses Re: Idea for the statistics collector
List pgsql-general
>Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Martijn van Oosterhout wrote:
> > >> Firstly, I was only thinking of going for the basic nodes (Index
> Scan, Seq
> > >> Scan, Distinct). Other types have far more variables. Secondly, even
> if you
> > >> only count, it's useful. For example, if it tells you that the
> planner is
> > >> off by a factor of 10 more than 75% of the time, that's useful
> information
> > >> independant of what the actual variables are.
> >
> > And reduce the usefulness even more ;-).  As a pure stats-gathering
> > exercise it might be worth doing, but not if you only log the failure
> > cases.  How will you know how well you are doing if you take a
> > biased-by-design sample?

Personally, given that it seems like at least once or twice a day someone
asks about performance or "why isn't my index being used" and other stuff -
I think doing this would be a great idea.

Perhaps not necessarily in the full-fledged way, but creating a sort of
"ANALYZE log," wherein it logs the optimizer's estimate of a query and the
actual results of a query, for every query. This, of course, could be
enableable/disableable on a per-connection basis, per-table basis (like
OIDs), or whatever other basis makes life easiest to the developers.

Then, when the next ANALYZE is run, it could do it's usual analysis, and
apply some additional heuristics based upon what it learns from the
"ANALYZE log," possibly to do several things:

1) Automatically increase/decrease the SET STATISTICS information included
in the analyze, for example, increasing it as a table grows larger and the
"randomness" grows less than linearly with size (e.g., if you have 50 or 60
groups in a 1,000,000 row table, that certainly needs a higher SET
STATISTICS and I do it on my tables).
2) Have an additional value on the statistics table called the
"index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
nothing, but otherwise modifies the cost of using an index/seq scan by that
factor - and don't ever change this more than a few percent each ANALYZE
3) Flags in a second log (maybe the regular log) really bad query estimates
- let it do an analysis of the queries and flag anything two or three std
deviations outside.

Now, I suggest all this stuff in the name of usability and
self-maintainability. Unfortunately, I don't have the wherewithal to
actually assist in development.

Another possibility is to put "use_seq_scan" default to OFF, or whatever
the parameter is (I did my optimizing a while ago so it's fading), so that
if there's an index, it will use it, regardless - as this seems to be what
the great majority of people expect to happen. And/or add this to a FAQ,
and let us all reply "see http://.../indexfaq.html." :)

Cheers,

Doug


pgsql-general by date:

Previous
From: nothanks@nowhere.com
Date:
Subject: Re: Help automate pg_dump
Next
From: Alexandre Sebrao
Date:
Subject: Getting this error when I use createdb