Re: Comment fix and question about dshash.c - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Comment fix and question about dshash.c
Date
Msg-id CAEepm=0RMoHPRovc5ugcPmKmaNRFUeZ+3bCf8uXbJ6Fx-1jCuQ@mail.gmail.com
Whole thread Raw
In response to Comment fix and question about dshash.c  (Antonin Houska <ah@cybertec.at>)
Responses Re: Comment fix and question about dshash.c  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: Comment fix and question about dshash.c  (Antonin Houska <ah@cybertec.at>)
List pgsql-hackers
On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
> 1. The return type of resize() function is void, so I propose part of the
> header comment to be removed:
>
> diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
> index b46f7c4cfd..b2b8fe60e1 100644
> --- a/src/backend/lib/dshash.c
> +++ b/src/backend/lib/dshash.c
> @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
>
>  /*
>   * Grow the hash table if necessary to the requested number of buckets.  The
> - * requested size must be double some previously observed size.  Returns true
> - * if the table was successfully expanded or found to be big enough already
> - * (because another backend expanded it).
> + * requested size must be double some previously observed size.
>   *
>   * Must be called without any partition lock held.
>   */

Thanks, will fix.

> 2. Can anyone please explain this macro?
>
> /* Max entries before we need to grow.  Half + quarter = 75% load factor. */
> #define MAX_COUNT_PER_PARTITION(hash_table)                             \
>         (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
>          BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)
>
> I'm failing to understand why the maximum number of hash table entries in a
> partition should be smaller than the number of buckets in that partition.
>
> The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
> obvious from this condition in dshash_find_or_insert()
>
>         /* Check if we are getting too full. */
>         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))

Are you saying there is a bug in this logic (which is nbuckets * 0.75
expressed as integer maths), or saying that 0.75 is not a good maximum
load factor?  I looked around at a couple of general purpose hash
tables and saw that some used 0.75 and some used 1.0, as a wasted
space-vs-collision trade-off.  If I have my maths right[1], with 0.75
you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
to have 100 entries in ~64 buckets.  It'd be a fair criticism that
that's arbitrarily different to the choice made for hash joins, and
dynahash's default is 1 (though it's a run-time parameter).

[1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Buildfarm failures for hash indexes: buffer leaks
Next
From: Thomas Munro
Date:
Subject: Re: Comment fix and question about dshash.c