Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id 53CAED68.4090104@fuzzy.cz
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: tweaking NTUP_PER_BUCKET
List pgsql-hackers
On 19.7.2014 23:07, Tomas Vondra wrote:
> On 19.7.2014 20:28, Tomas Vondra wrote:
> For the first case, a WARNING at the end of estimate_hash_bucketsize
> says this:
> 
> WARNING:  nbuckets=8388608.00 estfract=0.000001
> WARNING:  nbuckets=65536.00 estfract=0.000267
> 
> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
> 
> For the second case, I get this:
> 
> WARNING:  nbuckets=16777216.00 estfract=0.000001
> WARNING:  nbuckets=262144.00 estfract=0.000100
> 
> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
> 
> That sounds really strange, because small_table in the second test case
> is almost perfectly unique. And in the first test case, there are only
> very few keys with >2 occurences.

FWIW, the "problem" seems to be this:
 /*  * Adjust estimated bucketsize upward to account for skewed  * distribution. */
 if (avgfreq > 0.0 && mcvfreq > avgfreq)   estfract *= mcvfreq / avgfreq;

Which is explained like in the estimate_hash_bucketsize comment like this:

* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something *
close to the "worst case" bucket size. We try to estimate this by
* adjusting the fraction if there are too few distinct data values, and
* then scaling up by the ratio of the most common value's frequency to
* the average frequency.

The problem is that the two testcases don't behave like this, i.e. the
other relation does not have similar distribution (because there the
join key is unique). Actually, I wonder how frequently that happens and
I wouldn't be surprised if it was quite rare.

So maybe we shouldn't scale it the way we scale it now. For example we
could do this:
 if (avgfreq > 0.0 && mcvfreq > avgfreq)   estfract *= sqrt(mcvfreq / avgfreq);

Or instead of using the first MCV frequency (i.e. the frequency of the
most common value, which causes the unexpectedly hight tuple/bucket
values), use an average or median of the MCV frequenfies.

Either of these three changes "fixed" the first test case, some fixed
the second one. However it's pure guesswork, and I have how many plans
will be hit negatively by these changes.

What I think might work better is lowering the cost of searching small
hash tables, when the hash table fits into L1/L2... caches. The table I
posted in the first message in this thread illustrates that:
 kB      NTUP=1  NTUP=2  NTUP=4  NTUP=5  NTUP=9  NTUP=10 1407    6753    7218    7890    9324    9488    12574 7032
9417   11586   17417   15820   25732   25402 35157   13179   17101   27225   24763   43323   43175 70313   14203
18475  29354   26690   46840   46676 175782  14319   17458   25325   34542   37342   60949 351563  16297   19928
29124  43364   43232   70014 703125  19027   23125   33610   50283   50165   81398
 

So a small hash table (~1.4MB), fitting into L2 (measured on CPU with
~4MB cache) is influenced much less than large tables. I don't know
whether we should detect the CPU cache size somehow, or whether we
should assume some conservative size (say, 4MB sounds OK), or what. But
I think something like this might work
 if (hash_size < 4MB) {    /* take in only 10% of the difference */    estfract += (mcvfreq / avgfreq) * estfract *
0.1;} else if (hash_size > 16MB) {    estfract *= (mcvfreq / avgfreq); } else {    /* linear approximation between 10
and100% */    estfract += (mcvfreq / avgfreq) * estfract                * (0.1 + 0.9 * (hash_size - 4MB) / 12MB); }
 

Or maybe not, I'm not really sure. The problem is that the two test
cases really don't match the assumption that the outer relation has the
same distribution.

regards
Tomas



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: tweaking NTUP_PER_BUCKET
Next
From: Gavin Flower
Date:
Subject: Re: Proposal for updating src/timezone