Re: Horribly slow hash join - Mailing list pgsql-performance

From Greg Stark
Subject Re: Horribly slow hash join
Date
Msg-id 87llk7yh36.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Horribly slow hash join  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well,
> and is likely faster than any multiple-instruction way to do the same.
>
> The quoted article seems to be by someone who has spent a lot of time
> counting assembly cycles and none at all reading the last thirty years
> worth of CS literature.  Knuth's treatment of hashing has some actual
> math to it...

[incidentally, I just found that the quoted article was independently found by
Bruce Momjian who found it convincing enough to convert the hash_any table
over to it two years ago]

I just reviewed Knuth as well as C.L.R. and several papers from CACM and
SIGMOD.

It seems we have three options:

mod():

 Pro: empirical research shows it the best algorithm for avoiding collisions

 Con: requires the hash table be a prime size and far from a power of two.
 This is inconvenient to arrange for dynamic tables as used in postgres.

multiplication method (floor(tablesize * remainder(x * golden ratio)))

 Pro: works with any table size

 Con: I'm not clear if it can be done without floating point arithmetic.
      It seems like it ought to be possible though.

Universal hashing:

 Pro: won't create gotcha situations where the distribution of data suddenly
      and unexpectedly creates unexpected performance problems.

 Con: more complex.

It would be trivial to switch the implementation from mod() to the
multiplicative method which is more suited to postgres's needs. However
universal hashing has some appeal. I hate the idea that a hash join could
perform perfectly well one day and suddenly become pessimal when new data is
loaded.

In a sense universal hashing is less predictable. For a DSS system that could
be bad. A query that runs fine every day might fail one day in an
unpredictable way even though the data is unchanged.

But in another sense it would be more predictable in that if you run the query
a thousand times the average performance would be very close to the expected
regardless of what the data is. Whereas more traditional algorithms have some
patterns of data that will consistently perform badly.

It's probably not worth it but postgres could maybe even be tricky and pretest
the parameters against the common values in the statistics table, generating
new ones if they fail to generate a nice distribution. That doesn't really
guarantee anything though, except that those common values would at least be
well distributed to start with.

--
greg

pgsql-performance by date:

Previous
From: James Thornton
Date:
Subject: Re: Recommended File System Configuration
Next
From: "Cyrille Bonnet"
Date:
Subject: very high CPU usage in "top", but not in "mpstat"