Re: self-tuning histograms - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: self-tuning histograms |
Date | |
Msg-id | 1895.1022780004@sss.pgh.pa.us Whole thread Raw |
In response to | self-tuning histograms (Neil Conway <nconway@klamath.dyndns.org>) |
List | pgsql-hackers |
Neil Conway <nconway@klamath.dyndns.org> writes: > What does everyone think about adding self-tuning histograms > to PostgreSQL? > [ snip ] > I think that ST histograms would be useful because: > (1) It would make it easier for us to implement multi-dimensional > histograms (for more info, see the Aboulnaga and Shaudhuri). This seems potentially useful, although I think the paper seriously understates the difficulty of drawing meaningful deductions from real queries. A complex query is likely to contain other constraints besides the ones relevant to a particular histogram, which will make it difficult to extract the needed selectivity data --- the final tuple count certainly isn't what you need to know. Internal instrumentation (a la EXPLAIN ANALYZE) might give you the right numbers, but it depends a lot on what the plan is. An example: one of the main things you'd like multidimensional histograms for is to estimate join selectivities more accurately (this requires cross-table histograms, obviously). But in any join plan, you are going to push down any available single-table restriction clauses to the individual scan subplans, whereupon counting the join plan's output tuples will *not* give you an unskewed estimate of the overall distribution of the joined variables. > (2) I'm unsure of the accuracy of building histograms through > statistical sampling. My guess would be that ST histograms > would achieve better accuracy when it matters most -- i.e. I think not. The paper says that ST histograms are at best in the same league as traditional histograms, and in cases of high skew much worse. Unfortunately, high skew is exactly where you *need* a histogram; with low-skew data you can get away with assuming uniform distribution. So I thought they were being a bit overoptimistic about the usefulness of the technique. > (3) The need for manual DB maintainence through VACUUM and > ANALYZE is problematic. This technique would be a step in > the direction of removing that requirement. Self-tuning > databases are something a lot of industry players (IBM, > Microsoft, others) are working toward. "Self tuning" does not equate to "get rid of VACUUM and ANALYZE" in my view. I'd prefer to see those maintenance processes scheduled automatically, but that doesn't mean we don't need them. I think it'd probably be premature to think about self-tuning histograms as such. They look useful for multivariable histograms, and for estimating queries involving remote data sources, but we are nowhere near being able to make use of such histograms if we had them. I'd counsel working first on the planner to see how we could make use of multivariable histograms built using a more traditional method. If that flies, it'd be time enough to look at ST methods for collecting the histograms. regards, tom lane
pgsql-hackers by date: