Re: Division in dynahash.c due to HASH_FFACTOR - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Division in dynahash.c due to HASH_FFACTOR |
Date | |
Msg-id | CA+hUKGKOF6rNH7kBdTzF65EhdyJ_2ibNe_C-Yfy_f6E6MkhaoQ@mail.gmail.com Whole thread Raw |
In response to | Re: Division in dynahash.c due to HASH_FFACTOR (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Responses |
Re: Division in dynahash.c due to HASH_FFACTOR
(Jakub Wartak <Jakub.Wartak@tomtom.com>)
|
List | pgsql-hackers |
On Sat, Sep 5, 2020 at 2:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2020-Sep-04, Jakub Wartak wrote: > > After removing HASH_FFACTOR PostgreSQL still compiles... Would > > removing it break some external API/extensions ? > > FWIW, HASH_FFACTOR has *never* been used in Postgres core code. > > https://postgr.es/m/20160418180711.55ac82c0@fujitsu already reported > that this flag was unused a few years ago. And a search through > codesearch.debian.net shows that no software packaged by Debian uses > that flag either. > > *If* any third-party extension is using HASH_FFACTOR, it can easily be > unbroken by removing the flag or #defining it to zero; the removal would > only be a problem if anyone showed that there is a performance loss by > using the default fillfactor for some dynahash table instead of their > custom value. > > I think if we know that there's a 4% performance increase by removing > the division that comes with an ability to use a custom fillfactor that > nobody uses or has ever used, it makes sense to remove that ability. I think Tomas is right, and the correct fix for this would be to compute it up front every time the hash table size changes, but since apparently no one ever used it, +1 for just removing it and leaving it hard-coded. For what it's worth, for dshash.c I just hard coded it to double the hash table size whenever the number of elements in a partition exceeded 75%. Popular hash tables in standard libraries seem to use either .75 or 1.0 as a default or only setting. There's probably room for discussion about numbers in those ranges, but I think the concept of customisable load factors with a range much wider than that may be more relevant for disk-based hash tables (like our hash indexes), which have very different economics. I think the name HASH_FFACTOR may be a clue that this was possibly interference from disk-based hash tables from the same Berkeley people: rather than "load factor", the more usual term (?) for nentries/nbuckets in memory-based hash tables as a way to model number of expected collisions, they called it "fill factor", which I guess is because in disk-based designs your actually want a certain rate of collisions, because you're working not with elements in an array and wanting to avoid collisions while not wasting space, but rather with fixed sized buckets that you want to fill up, but not overflow. </archeological-speculation-mode> > ... however, if we're really tense about improving some hash table's > performance, it might be worth trying to use simplehash.h instead of > dynahash.c for it. Sure, but removing the dynahash IDIV also seems like a slam dunk removal of a useless expensive feature.
pgsql-hackers by date: