Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: tweaking NTUP_PER_BUCKET |
Date | |
Msg-id | 53CAED68.4090104@fuzzy.cz Whole thread Raw |
In response to | Re: tweaking NTUP_PER_BUCKET (Tomas Vondra <tv@fuzzy.cz>) |
Responses |
Re: tweaking NTUP_PER_BUCKET
|
List | pgsql-hackers |
On 19.7.2014 23:07, Tomas Vondra wrote: > On 19.7.2014 20:28, Tomas Vondra wrote: > For the first case, a WARNING at the end of estimate_hash_bucketsize > says this: > > WARNING: nbuckets=8388608.00 estfract=0.000001 > WARNING: nbuckets=65536.00 estfract=0.000267 > > There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity >> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows). > > For the second case, I get this: > > WARNING: nbuckets=16777216.00 estfract=0.000001 > WARNING: nbuckets=262144.00 estfract=0.000100 > > i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table. > > That sounds really strange, because small_table in the second test case > is almost perfectly unique. And in the first test case, there are only > very few keys with >2 occurences. FWIW, the "problem" seems to be this: /* * Adjust estimated bucketsize upward to account for skewed * distribution. */ if (avgfreq > 0.0 && mcvfreq > avgfreq) estfract *= mcvfreq / avgfreq; Which is explained like in the estimate_hash_bucketsize comment like this: * be nonuniformly occupied. If the other relation in the join has a key * distribution similar to this one's, then the most-loaded buckets are * exactly those that will be probed most often. Therefore, the "average" * bucket size for costing purposes should really be taken as something * close to the "worst case" bucket size. We try to estimate this by * adjusting the fraction if there are too few distinct data values, and * then scaling up by the ratio of the most common value's frequency to * the average frequency. The problem is that the two testcases don't behave like this, i.e. the other relation does not have similar distribution (because there the join key is unique). Actually, I wonder how frequently that happens and I wouldn't be surprised if it was quite rare. So maybe we shouldn't scale it the way we scale it now. For example we could do this: if (avgfreq > 0.0 && mcvfreq > avgfreq) estfract *= sqrt(mcvfreq / avgfreq); Or instead of using the first MCV frequency (i.e. the frequency of the most common value, which causes the unexpectedly hight tuple/bucket values), use an average or median of the MCV frequenfies. Either of these three changes "fixed" the first test case, some fixed the second one. However it's pure guesswork, and I have how many plans will be hit negatively by these changes. What I think might work better is lowering the cost of searching small hash tables, when the hash table fits into L1/L2... caches. The table I posted in the first message in this thread illustrates that: kB NTUP=1 NTUP=2 NTUP=4 NTUP=5 NTUP=9 NTUP=10 1407 6753 7218 7890 9324 9488 12574 7032 9417 11586 17417 15820 25732 25402 35157 13179 17101 27225 24763 43323 43175 70313 14203 18475 29354 26690 46840 46676 175782 14319 17458 25325 34542 37342 60949 351563 16297 19928 29124 43364 43232 70014 703125 19027 23125 33610 50283 50165 81398 So a small hash table (~1.4MB), fitting into L2 (measured on CPU with ~4MB cache) is influenced much less than large tables. I don't know whether we should detect the CPU cache size somehow, or whether we should assume some conservative size (say, 4MB sounds OK), or what. But I think something like this might work if (hash_size < 4MB) { /* take in only 10% of the difference */ estfract += (mcvfreq / avgfreq) * estfract * 0.1;} else if (hash_size > 16MB) { estfract *= (mcvfreq / avgfreq); } else { /* linear approximation between 10 and100% */ estfract += (mcvfreq / avgfreq) * estfract * (0.1 + 0.9 * (hash_size - 4MB) / 12MB); } Or maybe not, I'm not really sure. The problem is that the two test cases really don't match the assumption that the outer relation has the same distribution. regards Tomas
pgsql-hackers by date: