Hello,
I'm curious about something which may qualify as a stupid question.
What bad thing would happen if we used OIDs directly as hash values in
internal hash tables (that is, instead of uint32_hash() we'd use
uint32_identity(), or somehow optimise it away entirely, as you can
see in some C++ standard libraries for eg std::hash<int>)? From what
I know of this subject, the main risk is that, in general, the
distribution of integer keys stored in a hash table might
*accidentally* happen to have regular gaps with a stride that shares a
common factor with the number of buckets leading to an unbalanced hash
table (memory addresses are an example because they tend to have a
stride corresponding to hardware alignment rules, as are some
distributed sequence generation tricks), whereas it probably takes a
non-accidental attack to get higher collision rates with a decent hash
function. A good hash function would defend against that kind of
thing (but cost cycles to compute), and alternatively a prime number
of buckets would defend against it (but cost more cycles to %).
However, as far as I can see OIDs are expected to have an even
distribution (or at least we don't expect regular sized gaps), so the
hazard doesn't seem to apply.
One problem could be that, although collisions are not expected with
high probability when the hash table is big enough, when they do
occur, hash tables using linear probing-based collision strategies
(simplehash, but not dynahash) probably work less well if the chance
of non-empty buckets having non-empty neighbours (AKA clustering)
increases. I'm not sure whether to consider OIDs to be clustered in
general or not (though obviously the ones created by initdb are).
--
Thomas Munro
http://www.enterprisedb.com