Re: [HACKERS] Performance degradation in TPC-H Q18 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Performance degradation in TPC-H Q18
Date
Msg-id 20170301040222.3hp5s7rvbuow2mtr@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Performance degradation in TPC-H Q18  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2017-03-01 09:21:47 +0530, Robert Haas wrote:
> On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:
> >> BTW, another option to consider is lowering the target fillfactor.
> >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
> >> issue.  Obviously, it's better to detect when we need a lower
> >> fillfactor than to always use a lower one, but obviously the tighter
> >> you pack the hash table, the more likely it is that you're going to
> >> have these kinds of problems.
> >
> > Yea, that'd be one approach, but I feel better dynamically increasing
> > the fillfactor like in the patch. Even with a lower fillfactor you could
> > see issues, and as you say a higher fillfactor is nice [TM]; after the
> > patch I played with *increasing* the fillfactor, without being able to
> > measure a downside.
> 
> I am going to bet that increasing the fillfactor would be a mistake.
> The expected length of a clump in the table is going to be about
> 1/(1-ff), which increases very quickly beyond the current value of
> 0.8.  For example, if the actual fillfactor is 0.9 and the keys are
> perfectly distributed -- nine in a row and then one empty slot,
> lather, rinse, repeat -- probing for a key that's not there has a 10%
> chance of hitting an empty slot, a 10% chance of hitting the slot just
> before an empty slot, and so on.  So the expected number of probes to
> find that there's no match is 4.5.  However, it could easily happen
> that you have 3 or 4 empty slots in a row in one place and 27 or 36
> occupied slots in a row in another place, and that could cause all
> kinds of grief.  Moreover, real-world hash functions on real-world
> data aren't always perfect, which increases the chances of things
> going off track.  I'm not exactly sure how high a fillfactor is too
> high, but note that
> https://en.wikipedia.org/wiki/Hash_table#Open_addressing claims that
> "performance dramatically degrades when the load factor grows beyond
> 0.7 or so", which isn't cited but the same text appears in
> https://www.unf.edu/~wkloster/3540/wiki_book2.pdf for whatever that's
> worth.

That's without the "trick" of "robin hood" hashing though:
https://en.wikipedia.org/wiki/Hash_table#Robin_Hood_hashing
http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/
https://www.pvk.ca/Blog/2013/11/26/the-other-robin-hood-hashing/

it probably depends a bit on the scenario how high you want to have the
fillfactor - for hashaggs a high fill factor would probably be good
(they're often "read" heavy), for bitmapscans less so.

But I guess there's not much point in immediately increasing the
fillfactor, can do that in a second step.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] perlcritic
Next
From: Corey Huinker
Date:
Subject: Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands:\quit_if, \quit_unless)