Re: Hash tables in dynamic shared memory - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Hash tables in dynamic shared memory
Date
Msg-id 20161004222209.tx3aiee2h4ai4lfv@alap3.anarazel.de
Whole thread Raw
In response to Hash tables in dynamic shared memory  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: Hash tables in dynamic shared memory  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Hi,

> It's impossible to write a general purpose hash table that will be
> suitable for every use case, considering all the combinations of
> design trade offs and subsets of functionality someone might want.

Very much agreed.


> There is other work being done in this space:  I'm aware of Andres's
> current thread[1] on Robin Hood-style closed hashing tables with
> macro-ised size, hash and comparator operations à la C++ container
> templates, and I'm following along with interest.

> He vigorously
> rejects chaining hash tables as the natural enemies of memory
> hardware.

I'd not go quite that far. There are good arguments for open addressing,
and there's good arguments for open chaining. The latter is a lot more
convincing if you want concurrent access with partition based locking,
for example...


> I guess Andres's simplehash should already work in DSA memory nicely
> anyway since it doesn't have any internal pointers IIUC (?).

The data access doesn't have pointers, but the "metadata" does have a
pointer to the data. But that's a very solvable problem.  But
incremental resizing and granular locking aren't going to happen for it,
so it'd really only be suitable for presenting a read-only (or possibly
read-mostly) view.


> Potential use cases for DHT include caches, in-memory database objects
> and working state for parallel execution.

Is there a more concrete example, i.e. a user we'd convert to this at
the same time as introducing this hashtable?


> There are a couple of things I'm not happy about in this version of
> DHT: the space overhead per item is too high (hash + wasted padding +
> next pointer), and the iterator system is overly complicated.  I have
> another version in development which gets rid of the hash and padding
> and sometimes gets rid of the next pointer to achieve zero overhead,
> and I'm working on a simpler iteration algorithm that keeps the
> current guarantees but doesn't need to reorder bucket chains.  More on
> that, with some notes on performance and a testing module, soon.

FWIW, my experimentation showed that getting rid of the hash itself is
quite often *NOT* a worthwhile tradeof. Comparing keys and recomputing
hashes is often quite expensive (e.g. for GROUP BY it's where the
majority of time is spent after the simplehash patch).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Gilles Darold
Date:
Subject: Re: proposal: psql \setfileref
Next
From: Andres Freund
Date:
Subject: Idea: Tell compiler palloc() et al return aligned values