Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id 53CBBC95.5000702@fuzzy.cz
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On 20.7.2014 00:12, Tomas Vondra wrote:
> On 19.7.2014 23:07, Tomas Vondra wrote:
>> On 19.7.2014 20:28, Tomas Vondra wrote:
>> For the first case, a WARNING at the end of estimate_hash_bucketsize
>> says this:
>>
>> WARNING:  nbuckets=8388608.00 estfract=0.000001
>> WARNING:  nbuckets=65536.00 estfract=0.000267
>>
>> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
>>
>> For the second case, I get this:
>>
>> WARNING:  nbuckets=16777216.00 estfract=0.000001
>> WARNING:  nbuckets=262144.00 estfract=0.000100
>>
>> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
>>
>> That sounds really strange, because small_table in the second test case
>> is almost perfectly unique. And in the first test case, there are only
>> very few keys with >2 occurences.
> 
> FWIW, the "problem" seems to be this:
> 
>   /*
>    * Adjust estimated bucketsize upward to account for skewed
>    * distribution. */
> 
>   if (avgfreq > 0.0 && mcvfreq > avgfreq)
>     estfract *= mcvfreq / avgfreq;
> 
> Which is explained like in the estimate_hash_bucketsize comment like this:
> 
> * be nonuniformly occupied. If the other relation in the join has a key
> * distribution similar to this one's, then the most-loaded buckets are
> * exactly those that will be probed most often. Therefore, the "average"
> * bucket size for costing purposes should really be taken as something *
> close to the "worst case" bucket size. We try to estimate this by
> * adjusting the fraction if there are too few distinct data values, and
> * then scaling up by the ratio of the most common value's frequency to
> * the average frequency.

After thinking about this a bit more, I'm starting to question the logic
behind the adjustment. The thing is, with NTUP_PER_BUCKET=1, all the
tuples in a bucket should have the same hash value and (hash collisions
aside) the same join key value. That means we're not really searching
through the bucket, we're merely iterating and returning all the tuples.

With NTUP_PER_BUCKET=10 it maybe made sense, because it was expected to
see multiple hash values in a single bucket. With NTUP_PER_BUCKET=1
we shouldn't really expect that. So I think we should rip the MCV
adjustment out ouf estimate_hash_bucketsize (and indeed, it fixed both
the test cases).

The only case where I think this might play role is when we can't use
the optimal number of buckets (achieving NTUP_PER_BUCKET=1), but in that
case the patched ExecChooseHashTableSize prefers to increase number of
batches. So that seems safe to me.

Or in what scenario (with NTUP_PER_BUCKET=1) does this still make sense?


Two more comments:

(a) Isn't the minimum selectivity enforced in estimate_hash_bucketsize   (1.0e-6) too high? I mean, With 1GB work_mem I
canget ~20M tuples   into the hash table, and in that case it's off by an order of   magnitude. Maybe 1e-9 would be
moreappropriate?
 

(b) Currently, ExecChooseHashTableSize sizes the hash table (nbuckets)   using ntuples, but maybe ndistinct (if
available)would be better?   It's tricky, though, because while both values are initially   estimates, ndistinct is
notoriouslyless reliable. Also, we   currently don't have means to count ndistinct while building the   table (as
opposedto ntuples, which is trivial), which is needed   when resizing the hash table in case the initial estimate was
off.  So currently the code optimizes for "ndistinct==ntuples" which may   produce larger nbuckets values, but ISTM
it'sthe right choice.
 

regards
Tomas





pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
Next
From: Fabien COELHO
Date:
Subject: Re: pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"