Re: Why hash OIDs? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Why hash OIDs?
Date
Msg-id CAEepm=13rZOTkrgLzJOQrg=YMidHKNS6ovDvgZ=aG47-PjBSdA@mail.gmail.com
Whole thread Raw
In response to Re: Why hash OIDs?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Aug 29, 2018 at 1:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
> >> Yeah, it would be a terrible idea as a general hash function for use
> >> in contexts where the "avalanche effect" assumption is made about
> >> information being spread out over the bits (the HJ batching code
> >> wouldn't work for example).  I was wondering specifically about the
> >> limited case of hash tables that are used to look things up in caches.
>
> > I don't understand why that'd be ok there? With a simple 1:1 hash
> > function, which you seem to advocate, many hash-tables would be much
> > fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> > other range.  A lot of our tables are populated on-demand, so you'd
> > often end up with most of the data in one or two buckets, and a larger
> > number largely empty.
>
> Hmm ... but what we actually care about, for dynahash tables, is just
> the low order bits of the hash; the original range of the values
> isn't interesting.  Some desultory experimentation with queries like
>
> select oid::int % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> offers some support for Thomas' idea: the hashoid results are
> certainly not better distributed than the raw OIDs, and in some
> cases noticeably worse.

Yeah, I noticed that too.  In that example we'd presumably be using
4096 buckets for 2128 procedures, but the results are similar.  Of
course a bad example can also be contrived (as it can with for any
hash function, with varying amounts of work).  The question is just
whether such a situation is likely to be encountered unintentionally
with trivial identity hashes.  An argument against this idea is that
it's one of those situations where even though you can say "oh, that's
really unlikely to happen", if it does happen in your schema for
whatever reason it's persistent (cf https://xkcd.com/221/).

> I'd supposed that we needed to get the OIDs' high-order bits to
> participate in the hash in order to get decent results, but
> looking at these numbers makes me wonder.
>
> OTOH, I'm also not convinced it's worth messing with.  In principle
> we could shave some cycles with a no-op hash function, but I've not
> seen anything to make me think that hashoid() is a measurable
> bottleneck.

I do think there is probably a little bit more performance to squeeze
out of syscache though.  For example, I ran Andres's
pgbench-many-cols.sql[1] and saw a ~3% TPS increase just by sticking
pg_attribute_always_inline in front of SearchCatCacheInternal()
(single threaded pgbench, clang, -O2).  That may be in the range of
random performance changes caused by code movement, but the logical
extreme of inlining/specialisation seems like it should be a win: the
fast path of a syscache lookup could be inlined by the caller,
including the hash and eq functions (rather than going through
function pointers).  If we had identity OID hashing, I suppose you'd
be left with just a few instructions: mask the OID, index into the
bucket array, compare pointer, unlikely jump, compare the OID,
unlikely jump.  That said, a few extra instructions for murmur hash
hoop jumping might not make much difference compared to the benefits
of inlining/specialisation.

My real interest here is something similar to OIDs but not the same:
numbered undo logs.  I'll have more to say about that soon, but
basically I went from looking things up by index in arrays in DSM
segments in my previous version to a new system that doesn't require
DSM segments anymore but does require a hash table.  I found myself
wondering why I need to even use a hash function for my integer keys,
which seem to be pretty close to 'perfect' hashes and the generated
code is pleasingly close to the array index lookup I had before.
Apparently it's good enough for the designers of C++ standard
libraries, the Java standard library, etc etc to use trivial identity
hashing for integers, while not knowing anything at all about key
distribution.  I suspect it probably works out pretty well in a lot of
cases as long as you don't have a sequence generator that makes gaps
with big power of two strides, don't play tricks that read arbitrary
subsets of the bits, and use a strong hash_combine for compound keys.
OIDs seem to come close...

[1] https://www.postgresql.org/message-id/20170922061545.wkbzt7b6p6x47bzg%40alap3.anarazel.de

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: some pg_dump query code simplification
Next
From: Thomas Munro
Date:
Subject: Re: Why hash OIDs?