Thread: self-tuning histograms

self-tuning histograms

From

Neil Conway

Date:

29 May 2002, 23:05:24

What does everyone think about adding self-tuning histograms
to PostgreSQL?

Briefly, a self-tuning histogram is one that is constructed without
looking at the data in the attribute; it uses the information provided
by the query executor to adjust a default set of histograms that are
created when the table is defined. Thus, the histograms automatically
adapt to the data that is stored in the table -- as the data
distribution changes, so do the histograms.

Histogram refinement can take place in two possible ways: online
(as queries are executed, the histograms are updated immediately),
or offline (the necessary data is written to a log after every
query, which is processed on a regular basis to refine the
histograms).

The paper I've looked at on this topic is "Self-tuning
Histograms: Building Histograms Without Looking at Data", by
Aboulnaga and Shaudhuri (1999), which you can find here:
http://citeseer.nj.nec.com/255752.html -- please refer to
it for lots more information on this technique.

I think that ST histograms would be useful because:

(1) It would make it easier for us to implement multi-dimensional histograms (for more info, see the Aboulnaga and
Shaudhuri). Since no commercial system currently implements them, I think this would be a neat thing to have.

(2) I'm unsure of the accuracy of building histograms through statistical sampling. My guess would be that ST
histograms would achieve better accuracy when it matters most -- i.e. on those tables accessed the most often (since
thoseare the tables for which the most histogram refinement is done).

(3) The need for manual DB maintainence through VACUUM and ANALYZE is problematic. This technique would be a step in
the direction of removing that requirement. Self-tuning databases are something a lot of industry players (IBM,
Microsoft,others) are working toward.

(4) They scale well -- refining histograms on a 100 million tuple table is no different than on a 100 tuple table.

There are some disadvantages, however:

(1) Reproduceability: At the moment, the system's performance only changes when the data is changed, or the DBA makes
a configuration change. With this (and other "self-tuning" techniques, which are becoming very popular among
commercialdatabases), the system can change the state of the system without the intervention of the DBA. While I'd
hopethat those changes are for the better (i.e. histograms eventually converging toward "perfect" accuracy), that
won'talways be the case. I don't really see a way around this, other than letting the DBA disable ST histograms
whendebugging problems.

(2) Performance: As Aboulnaga and Shaudhuri point out, online histogram refinement can become a point of contention.
Obviously, we want to avoid that. I think online refinement is still possible as long as we:

(a) don't block waiting for locks: try to acquire the necessary locks to refine the histograms,
immediately give up if not possible

(b) delay histogram refinement so it doesn't interfere with the user: for example, store histogram
data locally and only update the system catalogs when the backend is idle

(c) only update the histogram when major changes can be applied: skip trivial refinements (or store
those in the offline log for later processing)

(d) allow the DBA to choose between offline and online histogram refinement (assuming we choose to
implement both)

Any comments?

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: self-tuning histograms

From

Gavin Sherry

Date:

29 May 2002, 23:50:25

Neil,

I've also been thinking about this but haven't had time to collect my
thoughts.

On Wed, 29 May 2002, Neil Conway wrote:

> Histogram refinement can take place in two possible ways: online
> (as queries are executed, the histograms are updated immediately),
> or offline (the necessary data is written to a log after every
> query, which is processed on a regular basis to refine the
> histograms).

I would have thought that offline would have meant that the histogram
refinement could be run at the DBA's leisure.

> There are some disadvantages, however:
> 
> (1) Reproduceability: At the moment, the system's performance
>     only changes when the data is changed, or the DBA makes a
>     configuration change. With this (and other "self-tuning"
>     techniques, which are becoming very popular among
>     commercial databases), the system can change the state of
>     the system without the intervention of the DBA. While I'd
>     hope that those changes are for the better (i.e. histograms
>     eventually converging toward "perfect" accuracy), that
>     won't always be the case. I don't really see a way around
>     this, other than letting the DBA disable ST histograms
>     when debugging problems.

Self-tuning would have to be optional.

> (2) Performance: As Aboulnaga and Shaudhuri point out, online
>     histogram refinement can become a point of contention.
>     Obviously, we want to avoid that. I think online refinement
>     is still possible as long as we:
> 
>         (a) don't block waiting for locks: try to acquire the
>             necessary locks to refine the histograms,
>             immediately give up if not possible
> 
>         (b) delay histogram refinement so it doesn't interfere
>             with the user: for example, store histogram data
>             locally and only update the system catalogs when
>             the backend is idle

This should be fine as long as the refinement system works through MVCC.

There is another consideration. If a database is using histogram
refinement then the 'base' data it works on must be accurate. If not,
refinement would compound the inaccuracy of the histogram. As such,
ANALYZE would have to scan the whole table (if/when run), COPY would have
to update the statistics, etc.

Gavin

Re: self-tuning histograms

From

Neil Conway

Date:

30 May 2002, 00:43:10

On Thu, 30 May 2002 13:52:08 +1000 (EST)
"Gavin Sherry" <swm@linuxworld.com.au> wrote:
> On Wed, 29 May 2002, Neil Conway wrote:
> > Histogram refinement can take place in two possible ways: online
> > (as queries are executed, the histograms are updated immediately),
> > or offline (the necessary data is written to a log after every
> > query, which is processed on a regular basis to refine the
> > histograms).
> 
> I would have thought that offline would have meant that the histogram
> refinement could be run at the DBA's leisure.

Yeah -- that makes more sense.

> > (2) Performance: As Aboulnaga and Shaudhuri point out, online
> >     histogram refinement can become a point of contention.
> >     Obviously, we want to avoid that.
> 
> This should be fine as long as the refinement system works through MVCC.

Good point -- the current pg_statistic routines are MVCC aware, so
there's no reason to change that. In that case, my concerns about
contention over histogram refinement may be unfounded. I still think we
should avoid redundant histogram refinements (i.e. don't update the
histograms on every single query), but I'm glad that MVCC solves most
of this problem for us.

> There is another consideration. If a database is using histogram
> refinement then the 'base' data it works on must be accurate. If not,
> refinement would compound the inaccuracy of the histogram.

Aboulnaga and Shaudhuri doesn't require this. The initial assumption
is that the attribute is uniformly distributed over the initial
buckets of the histogram. Naturally, this is incorrect: as queries
are executed, the initial histogram is modified by "refinement"
(the frequences of individual buckets are adjusted) and
"restructuring" (bucket boundaries are adjusted). For more
information (and the exact algorithms used), see sections 3.2
and 3.3 of the paper.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: self-tuning histograms

From

Tom Lane

Date:

30 May 2002, 13:37:58

Neil Conway <nconway@klamath.dyndns.org> writes:
> What does everyone think about adding self-tuning histograms
> to PostgreSQL?
> [ snip ]
> I think that ST histograms would be useful because:

> (1) It would make it easier for us to implement multi-dimensional
>     histograms (for more info, see the Aboulnaga and Shaudhuri).

This seems potentially useful, although I think the paper seriously
understates the difficulty of drawing meaningful deductions from real
queries.  A complex query is likely to contain other constraints besides
the ones relevant to a particular histogram, which will make it
difficult to extract the needed selectivity data --- the final tuple
count certainly isn't what you need to know.  Internal instrumentation
(a la EXPLAIN ANALYZE) might give you the right numbers, but it depends
a lot on what the plan is.

An example: one of the main things you'd like multidimensional
histograms for is to estimate join selectivities more accurately (this
requires cross-table histograms, obviously).  But in any join plan,
you are going to push down any available single-table restriction
clauses to the individual scan subplans, whereupon counting the join
plan's output tuples will *not* give you an unskewed estimate of the
overall distribution of the joined variables.

> (2) I'm unsure of the accuracy of building histograms through
>     statistical sampling. My guess would be that ST histograms
>     would achieve better accuracy when it matters most -- i.e.

I think not.  The paper says that ST histograms are at best in the same
league as traditional histograms, and in cases of high skew much worse.
Unfortunately, high skew is exactly where you *need* a histogram; with
low-skew data you can get away with assuming uniform distribution.  So
I thought they were being a bit overoptimistic about the usefulness of
the technique.

> (3) The need for manual DB maintainence through VACUUM and
>     ANALYZE is problematic. This technique would be a step in
>     the direction of removing that requirement. Self-tuning
>     databases are something a lot of industry players (IBM,
>     Microsoft, others) are working toward.

"Self tuning" does not equate to "get rid of VACUUM and ANALYZE" in my
view.  I'd prefer to see those maintenance processes scheduled
automatically, but that doesn't mean we don't need them.

I think it'd probably be premature to think about self-tuning histograms
as such.  They look useful for multivariable histograms, and for
estimating queries involving remote data sources, but we are nowhere
near being able to make use of such histograms if we had them.  I'd
counsel working first on the planner to see how we could make use of
multivariable histograms built using a more traditional method.  If that
flies, it'd be time enough to look at ST methods for collecting the
histograms.
        regards, tom lane