Re: Division in dynahash.c due to HASH_FFACTOR - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Division in dynahash.c due to HASH_FFACTOR
Date
Msg-id CA+hUKGKOF6rNH7kBdTzF65EhdyJ_2ibNe_C-Yfy_f6E6MkhaoQ@mail.gmail.com
Whole thread Raw
In response to Re: Division in dynahash.c due to HASH_FFACTOR  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: Division in dynahash.c due to HASH_FFACTOR  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
List pgsql-hackers
On Sat, Sep 5, 2020 at 2:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2020-Sep-04, Jakub Wartak wrote:
> > After removing HASH_FFACTOR PostgreSQL still compiles...  Would
> > removing it break some external API/extensions ?
>
> FWIW, HASH_FFACTOR has *never* been used in Postgres core code.
>
> https://postgr.es/m/20160418180711.55ac82c0@fujitsu already reported
> that this flag was unused a few years ago.  And a search through
> codesearch.debian.net shows that no software packaged by Debian uses
> that flag either.
>
> *If* any third-party extension is using HASH_FFACTOR, it can easily be
> unbroken by removing the flag or #defining it to zero; the removal would
> only be a problem if anyone showed that there is a performance loss by
> using the default fillfactor for some dynahash table instead of their
> custom value.
>
> I think if we know that there's a 4% performance increase by removing
> the division that comes with an ability to use a custom fillfactor that
> nobody uses or has ever used, it makes sense to remove that ability.

I think Tomas is right, and the correct fix for this would be to
compute it up front every time the hash table size changes, but since
apparently no one ever used it, +1 for just removing it and leaving it
hard-coded.

For what it's worth, for dshash.c I just hard coded it to double the
hash table size whenever the number of elements in a partition
exceeded 75%.  Popular hash tables in standard libraries seem to use
either .75 or 1.0 as a default or only setting.  There's probably room
for discussion about numbers in those ranges, but I think the concept
of customisable load factors with a range much wider than that may be
more relevant for disk-based hash tables (like our hash indexes),
which have very different economics.  I think the name HASH_FFACTOR
may be a clue that this was possibly interference from disk-based hash
tables from the same Berkeley people: rather than "load factor", the
more usual term (?) for nentries/nbuckets in memory-based hash tables
as a way to model number of expected collisions, they called it "fill
factor", which I guess is because in disk-based designs your actually
want a certain rate of collisions, because you're working not with
elements in an array and wanting to avoid collisions while not wasting
space, but rather with fixed sized buckets that you want to fill up,
but not overflow.  </archeological-speculation-mode>

> ... however, if we're really tense about improving some hash table's
> performance, it might be worth trying to use simplehash.h instead of
> dynahash.c for it.

Sure, but removing the dynahash IDIV also seems like a slam dunk
removal of a useless expensive feature.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgbench and timestamps (bounced)
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Logical Replication - detail message with names of missing columns