Thread: Why hash OIDs?

Why hash OIDs?

From
Thomas Munro
Date:
Hello,

I'm curious about something which may qualify as a stupid question.

What bad thing would happen if we used OIDs directly as hash values in
internal hash tables (that is, instead of uint32_hash() we'd use
uint32_identity(), or somehow optimise it away entirely, as you can
see in some C++ standard libraries for eg std::hash<int>)?  From what
I know of this subject, the main risk is that, in general, the
distribution of integer keys stored in a hash table might
*accidentally* happen to have regular gaps with a stride that shares a
common factor with the number of buckets leading to an unbalanced hash
table (memory addresses are an example because they tend to have a
stride corresponding to hardware alignment rules, as are some
distributed sequence generation tricks), whereas it probably takes a
non-accidental attack to get higher collision rates with a decent hash
function.  A good hash function would defend against that kind of
thing (but cost cycles to compute), and alternatively a prime number
of buckets would defend against it (but cost more cycles to %).
However, as far as I can see OIDs are expected to have an even
distribution (or at least we don't expect regular sized gaps), so the
hazard doesn't seem to apply.

One problem could be that, although collisions are not expected with
high probability when the hash table is big enough, when they do
occur, hash tables using linear probing-based collision strategies
(simplehash, but not dynahash) probably work less well if the chance
of non-empty buckets having non-empty neighbours (AKA clustering)
increases.  I'm not sure whether to consider OIDs to be clustered in
general or not (though obviously the ones created by initdb are).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Why hash OIDs?

From
Andres Freund
Date:
On 2018-08-28 13:50:43 +1200, Thomas Munro wrote:
> I'm curious about something which may qualify as a stupid question.
> 
> What bad thing would happen if we used OIDs directly as hash values in
> internal hash tables (that is, instead of uint32_hash() we'd use
> uint32_identity(), or somehow optimise it away entirely, as you can
> see in some C++ standard libraries for eg std::hash<int>)?

Oids are very much not equally distributed, so in all likelihood you'd
get cases very you currently have a reasonably well averaged out usage
of the hashtable, not be that anymore.

It's also fairly cheap to hash an oid.


> However, as far as I can see OIDs are expected to have an even
> distribution (or at least we don't expect regular sized gaps), so the
> hazard doesn't seem to apply.

Huh? Oids between, say, 1 and FirstNormalObjectId, are vastly more
common than the rest. And even after that, individual tables get large
clusters of sequential values to the global oid counter.


Greetings,

Andres Freund


Re: Why hash OIDs?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-08-28 13:50:43 +1200, Thomas Munro wrote:
>> What bad thing would happen if we used OIDs directly as hash values in
>> internal hash tables (that is, instead of uint32_hash() we'd use
>> uint32_identity(), or somehow optimise it away entirely, as you can
>> see in some C++ standard libraries for eg std::hash<int>)?

> Oids are very much not equally distributed, so in all likelihood you'd
> get cases very you currently have a reasonably well averaged out usage
> of the hashtable, not be that anymore.

Right.  In particular, most of our hash usages assume that all bits of
the hash value are equally "random", so that we can just mask off the
lowest N bits of the hash and not get values that are biased towards
particular hash buckets.  It's unlikely that raw OIDs would have that
property.

            regards, tom lane


Re: Why hash OIDs?

From
Thomas Munro
Date:
On Tue, Aug 28, 2018 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-08-28 13:50:43 +1200, Thomas Munro wrote:
> >> What bad thing would happen if we used OIDs directly as hash values in
> >> internal hash tables (that is, instead of uint32_hash() we'd use
> >> uint32_identity(), or somehow optimise it away entirely, as you can
> >> see in some C++ standard libraries for eg std::hash<int>)?
>
> > Oids are very much not equally distributed, so in all likelihood you'd
> > get cases very you currently have a reasonably well averaged out usage
> > of the hashtable, not be that anymore.
>
> Right.  In particular, most of our hash usages assume that all bits of
> the hash value are equally "random", so that we can just mask off the
> lowest N bits of the hash and not get values that are biased towards
> particular hash buckets.  It's unlikely that raw OIDs would have that
> property.

Yeah, it would be a terrible idea as a general hash function for use
in contexts where the "avalanche effect" assumption is made about
information being spread out over the bits (the HJ batching code
wouldn't work for example).  I was wondering specifically about the
limited case of hash tables that are used to look things up in caches.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Why hash OIDs?

From
Andres Freund
Date:
On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
> On Tue, Aug 28, 2018 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2018-08-28 13:50:43 +1200, Thomas Munro wrote:
> > >> What bad thing would happen if we used OIDs directly as hash values in
> > >> internal hash tables (that is, instead of uint32_hash() we'd use
> > >> uint32_identity(), or somehow optimise it away entirely, as you can
> > >> see in some C++ standard libraries for eg std::hash<int>)?
> >
> > > Oids are very much not equally distributed, so in all likelihood you'd
> > > get cases very you currently have a reasonably well averaged out usage
> > > of the hashtable, not be that anymore.
> >
> > Right.  In particular, most of our hash usages assume that all bits of
> > the hash value are equally "random", so that we can just mask off the
> > lowest N bits of the hash and not get values that are biased towards
> > particular hash buckets.  It's unlikely that raw OIDs would have that
> > property.
> 
> Yeah, it would be a terrible idea as a general hash function for use
> in contexts where the "avalanche effect" assumption is made about
> information being spread out over the bits (the HJ batching code
> wouldn't work for example).  I was wondering specifically about the
> limited case of hash tables that are used to look things up in caches.

I don't understand why that'd be ok there? With a simple 1:1 hash
function, which you seem to advocate, many hash-tables would be much
fuller in the 1000-3000 (where pg_class, etc all reside) than in any
other range.  A lot of our tables are populated on-demand, so you'd
often end up with most of the data in one or two buckets, and a larger
number largely empty.

Greetings,

Andres Freund


Re: Why hash OIDs?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
>> Yeah, it would be a terrible idea as a general hash function for use
>> in contexts where the "avalanche effect" assumption is made about
>> information being spread out over the bits (the HJ batching code
>> wouldn't work for example).  I was wondering specifically about the
>> limited case of hash tables that are used to look things up in caches.

> I don't understand why that'd be ok there? With a simple 1:1 hash
> function, which you seem to advocate, many hash-tables would be much
> fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> other range.  A lot of our tables are populated on-demand, so you'd
> often end up with most of the data in one or two buckets, and a larger
> number largely empty.

Hmm ... but what we actually care about, for dynahash tables, is just
the low order bits of the hash; the original range of the values
isn't interesting.  Some desultory experimentation with queries like

select oid::int % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

offers some support for Thomas' idea: the hashoid results are
certainly not better distributed than the raw OIDs, and in some
cases noticeably worse.

I'd supposed that we needed to get the OIDs' high-order bits to
participate in the hash in order to get decent results, but
looking at these numbers makes me wonder.

OTOH, I'm also not convinced it's worth messing with.  In principle
we could shave some cycles with a no-op hash function, but I've not
seen anything to make me think that hashoid() is a measurable
bottleneck.

            regards, tom lane


Re: Why hash OIDs?

From
Robert Haas
Date:
On Mon, Aug 27, 2018 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote:
> Huh? Oids between, say, 1 and FirstNormalObjectId, are vastly more
> common than the rest. And even after that, individual tables get large
> clusters of sequential values to the global oid counter.

Sure, but if you get a large cluster of sequential values, a straight
mod-N bucket mapping works just fine.   I think the bigger problem is
that you might get a large cluster of values separated by exactly a
power of 2.  For instance, say you have one serial column and one
index:

rhaas=# create table a (x serial primary key);
CREATE TABLE
rhaas=# create table b (x serial primary key);
CREATE TABLE
rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
  oid  |  oid
-------+-------
 16422 | 16430
(1 row)

If you have a lot of tables like that, bad things are going to happen
to your hash table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Why hash OIDs?

From
Thomas Munro
Date:
On Wed, Aug 29, 2018 at 1:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
> >> Yeah, it would be a terrible idea as a general hash function for use
> >> in contexts where the "avalanche effect" assumption is made about
> >> information being spread out over the bits (the HJ batching code
> >> wouldn't work for example).  I was wondering specifically about the
> >> limited case of hash tables that are used to look things up in caches.
>
> > I don't understand why that'd be ok there? With a simple 1:1 hash
> > function, which you seem to advocate, many hash-tables would be much
> > fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> > other range.  A lot of our tables are populated on-demand, so you'd
> > often end up with most of the data in one or two buckets, and a larger
> > number largely empty.
>
> Hmm ... but what we actually care about, for dynahash tables, is just
> the low order bits of the hash; the original range of the values
> isn't interesting.  Some desultory experimentation with queries like
>
> select oid::int % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> offers some support for Thomas' idea: the hashoid results are
> certainly not better distributed than the raw OIDs, and in some
> cases noticeably worse.

Yeah, I noticed that too.  In that example we'd presumably be using
4096 buckets for 2128 procedures, but the results are similar.  Of
course a bad example can also be contrived (as it can with for any
hash function, with varying amounts of work).  The question is just
whether such a situation is likely to be encountered unintentionally
with trivial identity hashes.  An argument against this idea is that
it's one of those situations where even though you can say "oh, that's
really unlikely to happen", if it does happen in your schema for
whatever reason it's persistent (cf https://xkcd.com/221/).

> I'd supposed that we needed to get the OIDs' high-order bits to
> participate in the hash in order to get decent results, but
> looking at these numbers makes me wonder.
>
> OTOH, I'm also not convinced it's worth messing with.  In principle
> we could shave some cycles with a no-op hash function, but I've not
> seen anything to make me think that hashoid() is a measurable
> bottleneck.

I do think there is probably a little bit more performance to squeeze
out of syscache though.  For example, I ran Andres's
pgbench-many-cols.sql[1] and saw a ~3% TPS increase just by sticking
pg_attribute_always_inline in front of SearchCatCacheInternal()
(single threaded pgbench, clang, -O2).  That may be in the range of
random performance changes caused by code movement, but the logical
extreme of inlining/specialisation seems like it should be a win: the
fast path of a syscache lookup could be inlined by the caller,
including the hash and eq functions (rather than going through
function pointers).  If we had identity OID hashing, I suppose you'd
be left with just a few instructions: mask the OID, index into the
bucket array, compare pointer, unlikely jump, compare the OID,
unlikely jump.  That said, a few extra instructions for murmur hash
hoop jumping might not make much difference compared to the benefits
of inlining/specialisation.

My real interest here is something similar to OIDs but not the same:
numbered undo logs.  I'll have more to say about that soon, but
basically I went from looking things up by index in arrays in DSM
segments in my previous version to a new system that doesn't require
DSM segments anymore but does require a hash table.  I found myself
wondering why I need to even use a hash function for my integer keys,
which seem to be pretty close to 'perfect' hashes and the generated
code is pleasingly close to the array index lookup I had before.
Apparently it's good enough for the designers of C++ standard
libraries, the Java standard library, etc etc to use trivial identity
hashing for integers, while not knowing anything at all about key
distribution.  I suspect it probably works out pretty well in a lot of
cases as long as you don't have a sequence generator that makes gaps
with big power of two strides, don't play tricks that read arbitrary
subsets of the bits, and use a strong hash_combine for compound keys.
OIDs seem to come close...

[1] https://www.postgresql.org/message-id/20170922061545.wkbzt7b6p6x47bzg%40alap3.anarazel.de

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Why hash OIDs?

From
Thomas Munro
Date:
On Wed, Aug 29, 2018 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 27, 2018 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote:
> > Huh? Oids between, say, 1 and FirstNormalObjectId, are vastly more
> > common than the rest. And even after that, individual tables get large
> > clusters of sequential values to the global oid counter.
>
> Sure, but if you get a large cluster of sequential values, a straight
> mod-N bucket mapping works just fine.   I think the bigger problem is
> that you might get a large cluster of values separated by exactly a
> power of 2.  For instance, say you have one serial column and one
> index:
>
> rhaas=# create table a (x serial primary key);
> CREATE TABLE
> rhaas=# create table b (x serial primary key);
> CREATE TABLE
> rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
>   oid  |  oid
> -------+-------
>  16422 | 16430
> (1 row)
>
> If you have a lot of tables like that, bad things are going to happen
> to your hash table.

Right.  I suppose that might happen accidentally when creating a lot
of partitions.

Advance the OID generator by some prime number after every CREATE TABLE?

/me ducks

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Why hash OIDs?

From
Thomas Munro
Date:
On Wed, Aug 29, 2018 at 11:05 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Aug 29, 2018 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > rhaas=# create table a (x serial primary key);
> > CREATE TABLE
> > rhaas=# create table b (x serial primary key);
> > CREATE TABLE
> > rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
> >   oid  |  oid
> > -------+-------
> >  16422 | 16430
> > (1 row)
> >
> > If you have a lot of tables like that, bad things are going to happen
> > to your hash table.
>
> Right.  I suppose that might happen accidentally when creating a lot
> of partitions.
>
> Advance the OID generator by some prime number after every CREATE TABLE?
>
> /me ducks

Erm, s/prime/random/.   Or use a different OID generator for each
catalogue so that attributes etc don't create gaps in pg_class OIDs.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Why hash OIDs?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Aug 29, 2018 at 11:05 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Aug 29, 2018 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
>>> rhaas=# create table a (x serial primary key);
>>> CREATE TABLE
>>> rhaas=# create table b (x serial primary key);
>>> CREATE TABLE
>>> rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
>>> oid  |  oid
>>> -------+-------
>>> 16422 | 16430
>>> (1 row)
>>> If you have a lot of tables like that, bad things are going to happen
>>> to your hash table.

>> Right.  I suppose that might happen accidentally when creating a lot
>> of partitions.
>> Advance the OID generator by some prime number after every CREATE TABLE?

> Erm, s/prime/random/.   Or use a different OID generator for each
> catalogue so that attributes etc don't create gaps in pg_class OIDs.

I think this argument is a red herring TBH.  The example Robert shows is
of *zero* interest for dynahash or catcache, unless it's taking only the
low order 3 bits of the OID for the bucket number.  But actually we'll
increase the table size proportionally to the number of entries, so
that you can't have say 1000 table entries without at least 10 bits
being used for the bucket number.  That means that you'd only have
trouble if those 1000 tables all had OIDs exactly 1K (or some multiple
of that) apart.  Such a case sounds quite contrived from here.

            regards, tom lane


Re: Why hash OIDs?

From
Robert Haas
Date:
On Tue, Aug 28, 2018 at 8:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think this argument is a red herring TBH.  The example Robert shows is
> of *zero* interest for dynahash or catcache, unless it's taking only the
> low order 3 bits of the OID for the bucket number.  But actually we'll
> increase the table size proportionally to the number of entries, so
> that you can't have say 1000 table entries without at least 10 bits
> being used for the bucket number.  That means that you'd only have
> trouble if those 1000 tables all had OIDs exactly 1K (or some multiple
> of that) apart.  Such a case sounds quite contrived from here.

Hmm.  I was thinking that it was a problem if the number of OIDs
consumed per table was a FACTOR of 1000, not just if it was a POWER of
1000.  I mean, if it's, say, 4, that means three-quarters of your hash
table buckets are unused, which seems poor.  But maybe it's not really
a big enough problem in practice for us to care?  Dunno.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company