Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers
From | Marko Ristola |
---|---|
Subject | Re: [PERFORM] Bad n_distinct estimation; hacks suggested? |
Date | |
Msg-id | 42712105.6030709@kolumbus.fi Whole thread Raw |
In response to | Re: [PERFORM] Bad n_distinct estimation; hacks suggested? (Greg Stark <gsstark@mit.edu>) |
List | pgsql-hackers |
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked about the PostgreSQL code, so I don't know, that what is possible now, and what is not. I do know, that the full table scan and after that incremental statistics changes are a very big change, without looking at the code. I meant the following idea: - compare two equal sized samples. Then redo the same thing with double sized samples. So do lots of unnecessary work. Check out the correlation of the two samples to try to guess the distribution. So I tried to give you an idea, not to give you a full answer into the whole problem. I did read some parts of the attached PDFs. They did convince me, that it seems, that the heuristics for the hard cases would actually read almost the whole table in many cases. I did cover the "too little sample" problem by stating that the user should be able to give the minimum size of samples. This way you would avoid the too small sampling problem. My purpose was not to achieve at most 5% wrong estimates, but to decrease the 2000% wrong estimates, that are seen now sometimes. Conclusions: - No heuristics or similar thing of small samples will grant excellent results. - If you need excellent estimates, you need to process the whole table! - Some special cases, like primary keys and the unique indexes and special case column types do give easy ways to make estimates: For example, wether a boolean column has zero, one or two distinct values, it does not matter so much ??? Hashing seems the right choise for all of them. If I have understund correctly, the full table scans are out of questions for large tables at this time. The percentage idea of taking 10% samples seems good. So here is another suggestion: 1. Do a full percentage scan, starting at an arbitrary position. If the user's data is not homogenous, this hurts it, but this way it is faster. During that scan, try to figure out all those columns, that have at most 100 distinct values. Of course, with it you can't go into 100% accuracy, but if the full table scan is out of question now, it is better, if the accuracy is for example at most ten times wrong. You could also improve accuracy by instead of doing a 10% partial table scan, you could do 20 pieces of 0,5 percent partial table scans: This would improve accuracy a bit, but keep the speed almost the same as the partial table scan. Here are questions for the statisticians for distinct values calculation: If we want at most 1000% tolerance, how big percentage of table's one column must be processed? If we want at most 500% tolerance, how big percentage of table's one column must be processed? If we want at most 250% tolerance, how big percentage of table's one column must be processed? Better to assume, that there are at most 100 distinct values on a table, if it helps calculations. If we try to get as much with one discontinuous partial table scan (0,1-10% sample), here is the information, we can gather: 1. We could gather a histogram for max(100) distinct values for each column for every column. 2. We could measure variance and average, and the number of rows for these 100 distinct values. 3. We could count the number of rows, that didn't match with these 100 distinct values: they were left out from the histogram. 4. We could get a minimum and a maximum value for each column. => We could get exact information about the sample with one 0,1-10% pass for many columns. What you statisticans can gather about these values? My idea is programmatical combined with statistics: + Performance: scan for example 100 blocks each of size 100Mb, because disc I/O is much faster this way. + Enables larger table percentage. I hope it helps with the statistics formula. Required because of more robust statistics: take those blocks at random (not over each other) places to decrease the effect from hitting into statistically bad parts on the table. + Less table scan passes: scan all columns with limited hashing in the first pass. + All easy columns are found here with one pass. +- Harder columns need an own pass each, but we have some preliminary knoledge of them on the given sample after all (minimum and maximum values and the histogram of the 100 distinct values). Marko Ristola Greg Stark wrote: >"Dave Held" <dave.held@arraysg.com> writes: > > > >>>Actually, it's more to characterize how large of a sample >>>we need. For example, if we sample 0.005 of disk pages, and >>>get an estimate, and then sample another 0.005 of disk pages >>>and get an estimate which is not even close to the first >>>estimate, then we have an idea that this is a table which >>>defies analysis based on small samples. >>> >>> >>I buy that. >> >> > >Better yet is to use the entire sample you've gathered of .01 and then perform >analysis on that sample to see what the confidence interval is. Which is >effectively the same as what you're proposing except looking at every possible >partition. > >Unfortunately the reality according to the papers that were sent earlier is >that you will always find the results disappointing. Until your sample is >nearly the entire table your estimates for n_distinct will be extremely >unreliable. > > >
pgsql-hackers by date: