Re: self-tuning histograms - Mailing list pgsql-hackers

From Tom Lane
Subject Re: self-tuning histograms
Date
Msg-id 1895.1022780004@sss.pgh.pa.us
Whole thread Raw
In response to self-tuning histograms  (Neil Conway <nconway@klamath.dyndns.org>)
List pgsql-hackers
Neil Conway <nconway@klamath.dyndns.org> writes:
> What does everyone think about adding self-tuning histograms
> to PostgreSQL?
> [ snip ]
> I think that ST histograms would be useful because:

> (1) It would make it easier for us to implement multi-dimensional
>     histograms (for more info, see the Aboulnaga and Shaudhuri).

This seems potentially useful, although I think the paper seriously
understates the difficulty of drawing meaningful deductions from real
queries.  A complex query is likely to contain other constraints besides
the ones relevant to a particular histogram, which will make it
difficult to extract the needed selectivity data --- the final tuple
count certainly isn't what you need to know.  Internal instrumentation
(a la EXPLAIN ANALYZE) might give you the right numbers, but it depends
a lot on what the plan is.

An example: one of the main things you'd like multidimensional
histograms for is to estimate join selectivities more accurately (this
requires cross-table histograms, obviously).  But in any join plan,
you are going to push down any available single-table restriction
clauses to the individual scan subplans, whereupon counting the join
plan's output tuples will *not* give you an unskewed estimate of the
overall distribution of the joined variables.

> (2) I'm unsure of the accuracy of building histograms through
>     statistical sampling. My guess would be that ST histograms
>     would achieve better accuracy when it matters most -- i.e.

I think not.  The paper says that ST histograms are at best in the same
league as traditional histograms, and in cases of high skew much worse.
Unfortunately, high skew is exactly where you *need* a histogram; with
low-skew data you can get away with assuming uniform distribution.  So
I thought they were being a bit overoptimistic about the usefulness of
the technique.

> (3) The need for manual DB maintainence through VACUUM and
>     ANALYZE is problematic. This technique would be a step in
>     the direction of removing that requirement. Self-tuning
>     databases are something a lot of industry players (IBM,
>     Microsoft, others) are working toward.

"Self tuning" does not equate to "get rid of VACUUM and ANALYZE" in my
view.  I'd prefer to see those maintenance processes scheduled
automatically, but that doesn't mean we don't need them.


I think it'd probably be premature to think about self-tuning histograms
as such.  They look useful for multivariable histograms, and for
estimating queries involving remote data sources, but we are nowhere
near being able to make use of such histograms if we had them.  I'd
counsel working first on the planner to see how we could make use of
multivariable histograms built using a more traditional method.  If that
flies, it'd be time enough to look at ST methods for collecting the
histograms.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: wierd AND condition evaluation for plpgsql
Next
From: Josh Burdick
Date:
Subject: finding medians