On Tue, Jun 22, 2021 at 1:55 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, 22 Jun 2021 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I kind of wonder if we really need four different hash table
> > implementations (this being the third "generic" one, plus hash join
> > has its own, and I may have forgotten others). Should we instead
> > think about revising simplehash to gain the benefits of this patch?
>
> hmm, yeah. I definitely agree with trying to have as much reusable
> code as we can when we can. It certainly reduces maintenance and bugs
> tend to be found more quickly too. It's a very worthy cause.
It is indeed really hard to decide when you have a new thing, and when
you need a new way to parameterise the existing generic thing. I
wondered about this how-many-hash-tables-does-it-take question a lot
when writing dshash.c (a chaining hash table that can live in weird
dsa.c memory, backed by DSM segments created on the fly that may be
mapped at different addresses in each backend, and has dynahash-style
partition locking), and this was around the time Andres was talking
about simplehash. In retrospect, I'd probably kick out the built-in
locking and partitions and instead let callers create their own
partitioning scheme on top from N tables, and that'd remove one quirk,
leaving only the freaky pointers and allocator. I recall from a
previous life that Boost's unordered_map template is smart enough to
support running in shared memory mapped at different addresses just
through parameterisation that controls the way it deals with internal
pointers (boost::unordered_map<..., ShmemAllocator>), which seemed
pretty clever to me, and it might be achievable to do the same with a
generic hash table for us that could take over dshash's specialness.
One idea I had at the time is that the right number of hash table
implementations in our tree is two: one for chaining (like dynahash)
and one for open addressing/probing (like simplehash), and that
everything else should be hoisted out (locking, partitioning) or made
into template parameters through the generic programming technique
that simplehash.h has demonstrated (allocators, magic pointer type for
internal pointers, plus of course the inlinable ops). But that was
before we'd really fully adopted the idea of this style of template
code. (I also assumed the weird memory stuff would be temporary and
we'd move to threads, but that's another topic for another thread.)
It seems like you'd disagree with this, and you'd say the right number
is three. But it's also possible to argue for one...
A more superficial comment: I don't like calling hash tables "hash".
I blame perl.