Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Improving N-Distinct estimation by ANALYZE |
Date | |
Msg-id | 1137179909.3180.91.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Improving N-Distinct estimation by ANALYZE (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE |
List | pgsql-hackers |
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote: > Anyway, since the proof is in the pudding, Simon and I will be working on > some demo code for different sampling methods so that we can debate > results rather than theory. I enclose a patch for checking out block sampling. This is not production ready, yet, but works bug-free on cvstip. Code comments have been fully updated to explain what's going on inside. All you need to do is "set analyze_blocks=b" and ANALYZE will switch over to using block sampling method and will read all the rows in "b" blocks. The sample size will also be limited by maintenance_work_mem. (Memory limitations could be smarter). This de-couples the specification of the sample size from the specification of the MCV/histogram size (stats_target). [Right now, I'm not suggesting that we have a GUC named this - it just exists for testing. If/when we agree to allow block sampling, then we can discuss how to invoke/specify it] The stats calculations aren't touched - it still uses Haas-Stokes. If you "set log_min_messages=DEBUG2" you'll also get more useful explanations of what the variables are and what decisions it makes about D for each column being analyzed. This patch has two main effects: - ANALYZE runs more than x10 faster to retrieve the same size sample - you can specify much larger samples for bigger tables, without increasing the stats targets Generally, the larger samples will give better results for the estimation. However, what is immediately apparent is that the Haas-Stokes estimator actually gets even worse with block sampling in the particular case I raised on-list. (Which is understandable, but not desirable). ISTM this is a strike against Haas-Stokes, rather than a strike against block sampling. So I'm looking at implementing the Brutlag & Richardson estimator(s) that cope with number-of-values-appearing in only one block. Not surprisingly that means some non-trivial additional code to retrieve blockids for each tuple and make decisions accordingly. I plan to use a similar technique to the existing TupnoLink array to match blockids. The B&R estimators should allow a fairly fast, small sample to be taken, making this more usable for dynamic sampling during query planning (when desirable, see recent -perform thread). It's also worth mentioning that for datatypes that only have an "=" operator the performance of compute_minimal_stats is O(N^2) when values are unique, so increasing sample size is a very bad idea in that case. It may be possible to re-sample the sample, so that we get only one row per block as with the current row sampling method. Another idea might be just to abort the analysis when it looks fairly unique, rather than churn through the whole sample. Best Regards, Simon Riggs
Attachment
pgsql-hackers by date: