Thread: Idea for the statistics collector

Idea for the statistics collector

From
Martijn van Oosterhout
Date:
Since it's currently all for collecting statistics on tables, why can't it
collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of rows
returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the number
going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Idea for the statistics collector

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Since it's currently all for collecting statistics on tables, why can't it
> collect another type of statistic, like:
> - How often the estimator gets it wrong?
> [snip]
> Does anyone see any problems with this?

(1) forced overhead on *every* query.
(2) contention to update the same rows of pg_statistic (or wherever you
    plan to store this info).
(3) okay, so the estimate was wrong; exactly which of the many
    parameters that went into the estimate do you plan to twiddle?
    What if it's not the parameter values that are at fault, but the
    cost-model equations themselves?

Closed-loop feedback is a great thing when you understand the dynamics
of the system you intend to apply feedback control to.  When you don't,
it's a great way to shoot yourself in the foot.  Unfortunately I don't
think the PG optimizer falls in the first category at present.

            regards, tom lane

Re: Idea for the statistics collector

From
Martijn van Oosterhout
Date:
On Fri, Jun 21, 2002 at 12:47:18AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Since it's currently all for collecting statistics on tables, why can't it
> > collect another type of statistic, like:
> > - How often the estimator gets it wrong?
> > [snip]
> > Does anyone see any problems with this?
>
> (1) forced overhead on *every* query.
If yo don't want it, don't use it. The current statistics have the same
issue and you can not do those as well.

> (2) contention to update the same rows of pg_statistic (or wherever you
>     plan to store this info).

True, can't avoid that. Depends on how many queries you. Maybe only enable
it for specific sessions?

> (3) okay, so the estimate was wrong; exactly which of the many
>     parameters that went into the estimate do you plan to twiddle?
>     What if it's not the parameter values that are at fault, but the
>     cost-model equations themselves?

Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
Scan, Distinct). Other types have far more variables. Secondly, even if you
only count, it's useful. For example, if it tells you that the planner is
off by a factor of 10 more than 75% of the time, that's useful information
independant of what the actual variables are.

> Closed-loop feedback is a great thing when you understand the dynamics
> of the system you intend to apply feedback control to.  When you don't,
> it's a great way to shoot yourself in the foot.  Unfortunately I don't
> think the PG optimizer falls in the first category at present.

Using the results for planning is obviously a tricky area and should proceed
with caution. But just collecting statistics shouldn't be too bad?

See also -hackers.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Idea for the statistics collector

From
Bruce Momjian
Date:
Martijn van Oosterhout wrote:
> Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
> Scan, Distinct). Other types have far more variables. Secondly, even if you
> only count, it's useful. For example, if it tells you that the planner is
> off by a factor of 10 more than 75% of the time, that's useful information
> independant of what the actual variables are.

Yes, only updating the stats if the estimate was off by a factor of 10
or so should cut down on the overhead.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Idea for the statistics collector

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Martijn van Oosterhout wrote:
>> Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
>> Scan, Distinct). Other types have far more variables. Secondly, even if you
>> only count, it's useful. For example, if it tells you that the planner is
>> off by a factor of 10 more than 75% of the time, that's useful information
>> independant of what the actual variables are.

> Yes, only updating the stats if the estimate was off by a factor of 10
> or so should cut down on the overhead.

And reduce the usefulness even more ;-).  As a pure stats-gathering
exercise it might be worth doing, but not if you only log the failure
cases.  How will you know how well you are doing if you take a
biased-by-design sample?

            regards, tom lane

Re: Idea for the statistics collector

From
Jan Wieck
Date:
Tom Lane wrote:
>
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Martijn van Oosterhout wrote:
> >> Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
> >> Scan, Distinct). Other types have far more variables. Secondly, even if you
> >> only count, it's useful. For example, if it tells you that the planner is
> >> off by a factor of 10 more than 75% of the time, that's useful information
> >> independant of what the actual variables are.
>
> > Yes, only updating the stats if the estimate was off by a factor of 10
> > or so should cut down on the overhead.
>
> And reduce the usefulness even more ;-).  As a pure stats-gathering
> exercise it might be worth doing, but not if you only log the failure
> cases.  How will you know how well you are doing if you take a
> biased-by-design sample?

Sure is it required to count all cases, success and failure. But I don't
see why it is required to feed that information constantly back into the
statistics tables. As long as we don't restart, it's perfectly good in
the collector. And it must not be fed back to the backend on every
query.

Maybe ANALYZE would like to have some of that information? If memory
serves, ANALYZE does a poor job when the data isn't well distributet,
has few distinct values and the like. That causes wrong estimates then
(among other things, of course). The idea could be, to have ANALYZE take
a much closer look at tables with horrible estimates, to generate better
information for those.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Idea for the statistics collector

From
Doug Fields
Date:
>Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Martijn van Oosterhout wrote:
> > >> Firstly, I was only thinking of going for the basic nodes (Index
> Scan, Seq
> > >> Scan, Distinct). Other types have far more variables. Secondly, even
> if you
> > >> only count, it's useful. For example, if it tells you that the
> planner is
> > >> off by a factor of 10 more than 75% of the time, that's useful
> information
> > >> independant of what the actual variables are.
> >
> > And reduce the usefulness even more ;-).  As a pure stats-gathering
> > exercise it might be worth doing, but not if you only log the failure
> > cases.  How will you know how well you are doing if you take a
> > biased-by-design sample?

Personally, given that it seems like at least once or twice a day someone
asks about performance or "why isn't my index being used" and other stuff -
I think doing this would be a great idea.

Perhaps not necessarily in the full-fledged way, but creating a sort of
"ANALYZE log," wherein it logs the optimizer's estimate of a query and the
actual results of a query, for every query. This, of course, could be
enableable/disableable on a per-connection basis, per-table basis (like
OIDs), or whatever other basis makes life easiest to the developers.

Then, when the next ANALYZE is run, it could do it's usual analysis, and
apply some additional heuristics based upon what it learns from the
"ANALYZE log," possibly to do several things:

1) Automatically increase/decrease the SET STATISTICS information included
in the analyze, for example, increasing it as a table grows larger and the
"randomness" grows less than linearly with size (e.g., if you have 50 or 60
groups in a 1,000,000 row table, that certainly needs a higher SET
STATISTICS and I do it on my tables).
2) Have an additional value on the statistics table called the
"index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
nothing, but otherwise modifies the cost of using an index/seq scan by that
factor - and don't ever change this more than a few percent each ANALYZE
3) Flags in a second log (maybe the regular log) really bad query estimates
- let it do an analysis of the queries and flag anything two or three std
deviations outside.

Now, I suggest all this stuff in the name of usability and
self-maintainability. Unfortunately, I don't have the wherewithal to
actually assist in development.

Another possibility is to put "use_seq_scan" default to OFF, or whatever
the parameter is (I did my optimizing a while ago so it's fading), so that
if there's an index, it will use it, regardless - as this seems to be what
the great majority of people expect to happen. And/or add this to a FAQ,
and let us all reply "see http://.../indexfaq.html." :)

Cheers,

Doug


Re: Idea for the statistics collector

From
Bruce Momjian
Date:
Added to TODO list:

    * Log queries where the optimizer row estimates were dramatically
      different from the number of rows actually found (?)

---------------------------------------------------------------------------

Doug Fields wrote:
>
> >Tom Lane wrote:
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > Martijn van Oosterhout wrote:
> > > >> Firstly, I was only thinking of going for the basic nodes (Index
> > Scan, Seq
> > > >> Scan, Distinct). Other types have far more variables. Secondly, even
> > if you
> > > >> only count, it's useful. For example, if it tells you that the
> > planner is
> > > >> off by a factor of 10 more than 75% of the time, that's useful
> > information
> > > >> independant of what the actual variables are.
> > >
> > > And reduce the usefulness even more ;-).  As a pure stats-gathering
> > > exercise it might be worth doing, but not if you only log the failure
> > > cases.  How will you know how well you are doing if you take a
> > > biased-by-design sample?
>
> Personally, given that it seems like at least once or twice a day someone
> asks about performance or "why isn't my index being used" and other stuff -
> I think doing this would be a great idea.
>
> Perhaps not necessarily in the full-fledged way, but creating a sort of
> "ANALYZE log," wherein it logs the optimizer's estimate of a query and the
> actual results of a query, for every query. This, of course, could be
> enableable/disableable on a per-connection basis, per-table basis (like
> OIDs), or whatever other basis makes life easiest to the developers.
>
> Then, when the next ANALYZE is run, it could do it's usual analysis, and
> apply some additional heuristics based upon what it learns from the
> "ANALYZE log," possibly to do several things:
>
> 1) Automatically increase/decrease the SET STATISTICS information included
> in the analyze, for example, increasing it as a table grows larger and the
> "randomness" grows less than linearly with size (e.g., if you have 50 or 60
> groups in a 1,000,000 row table, that certainly needs a higher SET
> STATISTICS and I do it on my tables).
> 2) Have an additional value on the statistics table called the
> "index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
> nothing, but otherwise modifies the cost of using an index/seq scan by that
> factor - and don't ever change this more than a few percent each ANALYZE
> 3) Flags in a second log (maybe the regular log) really bad query estimates
> - let it do an analysis of the queries and flag anything two or three std
> deviations outside.
>
> Now, I suggest all this stuff in the name of usability and
> self-maintainability. Unfortunately, I don't have the wherewithal to
> actually assist in development.
>
> Another possibility is to put "use_seq_scan" default to OFF, or whatever
> the parameter is (I did my optimizing a while ago so it's fading), so that
> if there's an index, it will use it, regardless - as this seems to be what
> the great majority of people expect to happen. And/or add this to a FAQ,
> and let us all reply "see http://.../indexfaq.html." :)
>
> Cheers,
>
> Doug
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073