Thread: Comment fix and question about dshash.c

Comment fix and question about dshash.c

From
Antonin Houska
Date:
1. The return type of resize() function is void, so I propose part of the
header comment to be removed:

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..b2b8fe60e1 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)

 /*
  * Grow the hash table if necessary to the requested number of buckets.  The
- * requested size must be double some previously observed size.  Returns true
- * if the table was successfully expanded or found to be big enough already
- * (because another backend expanded it).
+ * requested size must be double some previously observed size.
  *
  * Must be called without any partition lock held.
  */


2. Can anyone please explain this macro?

/* Max entries before we need to grow.  Half + quarter = 75% load factor. */
#define MAX_COUNT_PER_PARTITION(hash_table)                \
    (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
     BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)

I'm failing to understand why the maximum number of hash table entries in a
partition should be smaller than the number of buckets in that partition.

The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
obvious from this condition in dshash_find_or_insert()

    /* Check if we are getting too full. */
    if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: Comment fix and question about dshash.c

From
Thomas Munro
Date:
On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
> 1. The return type of resize() function is void, so I propose part of the
> header comment to be removed:
>
> diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
> index b46f7c4cfd..b2b8fe60e1 100644
> --- a/src/backend/lib/dshash.c
> +++ b/src/backend/lib/dshash.c
> @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
>
>  /*
>   * Grow the hash table if necessary to the requested number of buckets.  The
> - * requested size must be double some previously observed size.  Returns true
> - * if the table was successfully expanded or found to be big enough already
> - * (because another backend expanded it).
> + * requested size must be double some previously observed size.
>   *
>   * Must be called without any partition lock held.
>   */

Thanks, will fix.

> 2. Can anyone please explain this macro?
>
> /* Max entries before we need to grow.  Half + quarter = 75% load factor. */
> #define MAX_COUNT_PER_PARTITION(hash_table)                             \
>         (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
>          BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)
>
> I'm failing to understand why the maximum number of hash table entries in a
> partition should be smaller than the number of buckets in that partition.
>
> The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
> obvious from this condition in dshash_find_or_insert()
>
>         /* Check if we are getting too full. */
>         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))

Are you saying there is a bug in this logic (which is nbuckets * 0.75
expressed as integer maths), or saying that 0.75 is not a good maximum
load factor?  I looked around at a couple of general purpose hash
tables and saw that some used 0.75 and some used 1.0, as a wasted
space-vs-collision trade-off.  If I have my maths right[1], with 0.75
you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
to have 100 entries in ~64 buckets.  It'd be a fair criticism that
that's arbitrarily different to the choice made for hash joins, and
dynahash's default is 1 (though it's a run-time parameter).

[1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Comment fix and question about dshash.c

From
Thomas Munro
Date:
On Sat, Oct 27, 2018 at 10:03 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> space-vs-collision trade-off.  If I have my maths right[1], with 0.75
> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
> to have 100 entries in ~64 buckets.  It'd be a fair criticism that

Forgot to add: assuming 100 buckets, for illustration (the real number
is a power of 2).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Comment fix and question about dshash.c

From
Antonin Houska
Date:
Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
>
> Thanks, will fix.
>
> > 2. Can anyone please explain this macro?
> >
> > /* Max entries before we need to grow.  Half + quarter = 75% load factor. */
> > #define MAX_COUNT_PER_PARTITION(hash_table)                             \
> >         (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
> >          BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)
> >
> > I'm failing to understand why the maximum number of hash table entries in a
> > partition should be smaller than the number of buckets in that partition.
> >
> > The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
> > obvious from this condition in dshash_find_or_insert()
> >
> >         /* Check if we are getting too full. */
> >         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
>
> Are you saying there is a bug in this logic (which is nbuckets * 0.75
> expressed as integer maths), or saying that 0.75 is not a good maximum
> load factor?  I looked around at a couple of general purpose hash
> tables and saw that some used 0.75 and some used 1.0, as a wasted
> space-vs-collision trade-off.  If I have my maths right[1], with 0.75
> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
> to have 100 entries in ~64 buckets.

I don't know how exactly you apply the [1] formula (what is "n" and what is
"N" in your case?), but my consideration was much simpler: For example, if
BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more
convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the
hashtable gets resized if we're going to add the 7th entry to the partition,
i.e. we the number of entries in the partition is lower than the number of
buckets. Is that o.k.?

> [1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: Comment fix and question about dshash.c

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> > On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Are you saying there is a bug in this logic (which is nbuckets * 0.75
> > expressed as integer maths), or saying that 0.75 is not a good maximum
> > load factor?  I looked around at a couple of general purpose hash
> > tables and saw that some used 0.75 and some used 1.0, as a wasted
> > space-vs-collision trade-off.  If I have my maths right[1], with 0.75
> > you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
> > to have 100 entries in ~64 buckets.
>
> I don't know how exactly you apply the [1] formula (what is "n" and what is
> "N" in your case?), but my consideration was much simpler: For example, if
> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more
> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the
> hashtable gets resized if we're going to add the 7th entry to the partition,
> i.e. we the number of entries in the partition is lower than the number of
> buckets. Is that o.k.?

Well, it may be o.k. I've just checked what the fill factor means in hash
index and it's also the number of entries divided by the number of buckets.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: Comment fix and question about dshash.c

From
Tomas Vondra
Date:
On 10/27/2018 01:51 PM, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
> 
>> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>
>>> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
>>>
>>> Are you saying there is a bug in this logic (which is nbuckets * 0.75
>>> expressed as integer maths), or saying that 0.75 is not a good maximum
>>> load factor?  I looked around at a couple of general purpose hash
>>> tables and saw that some used 0.75 and some used 1.0, as a wasted
>>> space-vs-collision trade-off.  If I have my maths right[1], with 0.75
>>> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
>>> to have 100 entries in ~64 buckets.
>>
>> I don't know how exactly you apply the [1] formula (what is "n" and what is
>> "N" in your case?), but my consideration was much simpler: For example, if
>> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more
>> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the
>> hashtable gets resized if we're going to add the 7th entry to the partition,
>> i.e. we the number of entries in the partition is lower than the number of
>> buckets. Is that o.k.?
> 
> Well, it may be o.k. I've just checked what the fill factor means in hash
> index and it's also the number of entries divided by the number of buckets.
> 

Using load factor ~0.75 is definitely the right thing to do. One way to 
interpret it is "average chain length" (or whatever is the approach in 
that particular hash table implementations) and one of the main points 
of hash tables is to eliminate linear scans. We could pick a value 
closer to 1.0, but experience says it's not worth it - it's easy to get 
a annoyingly long chains due to hash collisions, in exchange for fairly 
minimal space savings.

That being said, I wonder if we should tweak NTUP_PER_BUCKET=1 in hash 
joins to a lower value. IIRC we ended up with 1.0 because originally it 
was set to 10.0, and we reduced it to 1.0 in 9.5 which gave us nice 
speedup. But I don't recall if we tried using even lower value (probably 
not). Maybe we don't need to do that because we only build the hash 
table at the very end, when we exactly know how many entries will it 
contain, so we don't need to do lookups and inserts at the same time, 
and we don't need to grow the hash table (at least in the non-parallel 
case). And we end up with 0.75 load factor on average, due to the 
doubling (the sizes are essentially uniformly distributed between 
0.5+epsilon and 1.0-epsilon).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Comment fix and question about dshash.c

From
Thomas Munro
Date:
On Sun, Oct 28, 2018 at 1:22 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 10/27/2018 01:51 PM, Antonin Houska wrote:
> > Antonin Houska <ah@cybertec.at> wrote:
> >> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> >>> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
> >>> Are you saying there is a bug in this logic (which is nbuckets * 0.75
> >>> expressed as integer maths), or saying that 0.75 is not a good maximum
> >>> load factor?  I looked around at a couple of general purpose hash
> >>> tables and saw that some used 0.75 and some used 1.0, as a wasted
> >>> space-vs-collision trade-off.  If I have my maths right[1], with 0.75
> >>> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
> >>> to have 100 entries in ~64 buckets.
> >>
> >> I don't know how exactly you apply the [1] formula (what is "n" and what is
> >> "N" in your case?), but my consideration was much simpler: For example, if
> >> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more
> >> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus the
> >> hashtable gets resized if we're going to add the 7th entry to the partition,
> >> i.e. we the number of entries in the partition is lower than the number of
> >> buckets. Is that o.k.?

n balls = the keys being hashed and insert
N bins = the hash table buckets

Unless we know of some special properties of the hash function and
input data, I believe we have to treat the hashes like uniformly
distributed random numbers when making predictions, hence the
balls-into-bins probability stuff that can tell you about the expected
distribution.

The expected number of occupied bins for 1 million balls into 1 million bins:

select 1000000.0 * (1.0 - pow(1.0 - (1.0 / 1000000), 1000000.0));
-> 632120.7427683549057142050000000

The same thing by counting distinct positive random numbers modulo 1 million:

select count(distinct (random() * 200000000)::int % 1000000) from
generate_series(1, 1000000) ss;
-> 632373, 632246, 631954, ...

Distinct hashes of the first 1 million integers (arbitrary keys)
modulo 1 million (buckets):

select count(distinct abs(hashoid(n)) % 1000000) from
generate_series(1, 1000000) ss(n);
-> 632115

(For a moment I was confused about getting a higher number until I
realised I need abs() to avoid doubling the effective number of
buckets...)

> > Well, it may be o.k. I've just checked what the fill factor means in hash
> > index and it's also the number of entries divided by the number of buckets.
> >
>
> Using load factor ~0.75 is definitely the right thing to do. One way to
> interpret it is "average chain length" (or whatever is the approach in
> that particular hash table implementations) and one of the main points
> of hash tables is to eliminate linear scans. We could pick a value
> closer to 1.0, but experience says it's not worth it - it's easy to get
> a annoyingly long chains due to hash collisions, in exchange for fairly
> minimal space savings.

Using the hashes of the first 1 million integers, let's try putting
them into different numbers of buckets, and see how many buckets we
get with each chain length:

WITH entries AS (SELECT abs(hashoid(generate_series(1, 1000000))) %
NBUCKETS AS bucket),
     lengths AS (SELECT bucket, COUNT(*) AS chain_length FROM entries
GROUP BY bucket)
SELECT chain_length, COUNT(*) AS count
  FROM lengths GROUP BY chain_length ORDER BY 2 DESC;

NBUCKETS = 1000000 (load factor 1.0)

 chain_length | count
--------------+--------
            1 | 367537
            2 | 184580
            3 |  61109
            4 |  15197
            5 |   3079
            6 |    509
            7 |     95
            8 |      7
            9 |      2

NBUCKETS = 1333333 (load factor 0.75)

 chain_length | count
--------------+--------
            1 | 472012
            2 | 177174
            3 |  44348
            4 |   8361
            5 |   1243
            6 |    143
            7 |      9
            8 |      2

NBUCKETS = 2000000 (load factor 0.5)

 chain_length | count
--------------+--------
            1 | 606451
            2 | 151741
            3 |  25226
            4 |   3148
            5 |    323
            6 |     28
            7 |      2

> That being said, I wonder if we should tweak NTUP_PER_BUCKET=1 in hash
> joins to a lower value. IIRC we ended up with 1.0 because originally it
> was set to 10.0, and we reduced it to 1.0 in 9.5 which gave us nice
> speedup. But I don't recall if we tried using even lower value (probably
> not). Maybe we don't need to do that because we only build the hash
> table at the very end, when we exactly know how many entries will it
> contain, so we don't need to do lookups and inserts at the same time,
> and we don't need to grow the hash table (at least in the non-parallel
> case). And we end up with 0.75 load factor on average, due to the
> doubling (the sizes are essentially uniformly distributed between
> 0.5+epsilon and 1.0-epsilon).

Yeah, it would indeed be interesting to experiment with load factors
lower than 1.0.  Every link in the chain is a cache miss.  In future
we should work on mitigating that by prefetching during both building
and probing (see nearby thread), but I suspect you can only do that
effectively for the bucket header and perhaps the first tuple in the
chain; if I'm right then longer chains will become even more expensive
relative to short ones.

By the way, aside from wasting memory, when you add extra buckets you
make full/right outer hash joins slower because they scan all the
buckets.

-- 
Thomas Munro
http://www.enterprisedb.com