Thread: Hash tables in dynamic shared memory

Hash tables in dynamic shared memory

From
Thomas Munro
Date:
Hi hackers,

Here is an experimental hash table implementation that uses DSA memory
so that hash tables can be shared by multiple backends and yet grow
dynamically.  Development name:  "DHT".

It's impossible to write a general purpose hash table that will be
suitable for every use case, considering all the combinations of
design trade offs and subsets of functionality someone might want.
What I'm trying to do here is produce something that is generally
useful for many cases and easy to use, along the lines of dynahash,
but use dynamic shared memory.  Here is the bullet point summary:

 * DSA memory-backed
 * open hashing (separate chaining)
 * incremental resizing
 * built-in partition-based locking
 * concurrent iteration

There is other work being done in this space:  I'm aware of Andres's
current thread[1] on Robin Hood-style closed hashing tables with
macro-ised size, hash and comparator operations à la C++ container
templates, and I'm following along with interest.  He vigorously
rejects chaining hash tables as the natural enemies of memory
hardware.  Obviously there are pros and cons: this node-based chaining
design allows resizing, iteration with stronger guarantees, and users
can point directly to (or into) the entries from other data
structures, but yeah... it has to follow pointers into far away cache
lines to do that.  I guess Andres's simplehash should already work in
DSA memory nicely anyway since it doesn't have any internal pointers
IIUC (?).  As for concurrency, Robert Haas also did some interesting
experiments[2] with (nearly) lock-free hash tables and hazard
pointers.  I'm not sure what the optimum number of different
implementations or levels of configurability is, and I'm certainly not
wedded to this particular set of design tradeoffs.

Potential use cases for DHT include caches, in-memory database objects
and working state for parallel execution.  (Not a use case: the shared
hash table for my as-yet-unposted DSA-based shared parallel hash join
table work, which follows the existing open-coded dense_alloc-based
approach for reasons I'll explain later.)

There are a couple of things I'm not happy about in this version of
DHT: the space overhead per item is too high (hash + wasted padding +
next pointer), and the iterator system is overly complicated.  I have
another version in development which gets rid of the hash and padding
and sometimes gets rid of the next pointer to achieve zero overhead,
and I'm working on a simpler iteration algorithm that keeps the
current guarantees but doesn't need to reorder bucket chains.  More on
that, with some notes on performance and a testing module, soon.

Please find attached dht-v1.patch, which depends on dsa-v2.patch[3].
I'm posting this version of DHT today because it is a dependency of
another patch due on this list shortly.

See also simple-cache.patch which contains a simple smoke test
extension so you can type SELECT simple_cache_put('42', 'Hello
world'), SELECT simple_cache_get('42') etc.

Thanks to my colleagues Amit Khandekar, Amul Sul, Dilip Kumar and John
Gorman for bug fixes and suggestions, and Robert Haas for feedback.

Please let me know your thoughts, and thanks for reading.

[1] https://www.postgresql.org/message-id/flat/20161003041215.tfrifbeext3i7nkj%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/CA%2BTgmoYE4t-Pt%2Bv08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CAEepm%3D1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Hash tables in dynamic shared memory

From
Andres Freund
Date:
Hi,

> It's impossible to write a general purpose hash table that will be
> suitable for every use case, considering all the combinations of
> design trade offs and subsets of functionality someone might want.

Very much agreed.


> There is other work being done in this space:  I'm aware of Andres's
> current thread[1] on Robin Hood-style closed hashing tables with
> macro-ised size, hash and comparator operations à la C++ container
> templates, and I'm following along with interest.

> He vigorously
> rejects chaining hash tables as the natural enemies of memory
> hardware.

I'd not go quite that far. There are good arguments for open addressing,
and there's good arguments for open chaining. The latter is a lot more
convincing if you want concurrent access with partition based locking,
for example...


> I guess Andres's simplehash should already work in DSA memory nicely
> anyway since it doesn't have any internal pointers IIUC (?).

The data access doesn't have pointers, but the "metadata" does have a
pointer to the data. But that's a very solvable problem.  But
incremental resizing and granular locking aren't going to happen for it,
so it'd really only be suitable for presenting a read-only (or possibly
read-mostly) view.


> Potential use cases for DHT include caches, in-memory database objects
> and working state for parallel execution.

Is there a more concrete example, i.e. a user we'd convert to this at
the same time as introducing this hashtable?


> There are a couple of things I'm not happy about in this version of
> DHT: the space overhead per item is too high (hash + wasted padding +
> next pointer), and the iterator system is overly complicated.  I have
> another version in development which gets rid of the hash and padding
> and sometimes gets rid of the next pointer to achieve zero overhead,
> and I'm working on a simpler iteration algorithm that keeps the
> current guarantees but doesn't need to reorder bucket chains.  More on
> that, with some notes on performance and a testing module, soon.

FWIW, my experimentation showed that getting rid of the hash itself is
quite often *NOT* a worthwhile tradeof. Comparing keys and recomputing
hashes is often quite expensive (e.g. for GROUP BY it's where the
majority of time is spent after the simplehash patch).

Greetings,

Andres Freund



Re: Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de> wrote:
>> Potential use cases for DHT include caches, in-memory database objects
>> and working state for parallel execution.
>
> Is there a more concrete example, i.e. a user we'd convert to this at
> the same time as introducing this hashtable?

A colleague of mine will shortly post a concrete patch to teach an
existing executor node how to be parallel aware, using DHT.  I'll let
him explain.

I haven't looked into whether it would make sense to convert any
existing shmem dynahash hash table to use DHT.  The reason for doing
so would be to move it out to DSM segments and enable dynamically
growing.  I suspect that the bounded size of things like the hash
tables involved in (for example) predicate locking is considered a
feature, not a bug, so any such cluster-lifetime core-infrastructure
hash table would not be a candidate.  More likely candidates would be
ephemeral data used by the executor, as in the above-mentioned patch,
and long lived caches of dynamic size owned by core code or
extensions.  Like a shared query plan cache, if anyone can figure out
the invalidation magic required.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de> wrote:
>>> Potential use cases for DHT include caches, in-memory database objects
>>> and working state for parallel execution.
>>
>> Is there a more concrete example, i.e. a user we'd convert to this at
>> the same time as introducing this hashtable?
>
> A colleague of mine will shortly post a concrete patch to teach an
> existing executor node how to be parallel aware, using DHT.  I'll let
> him explain.
>
> I haven't looked into whether it would make sense to convert any
> existing shmem dynahash hash table to use DHT.  The reason for doing
> so would be to move it out to DSM segments and enable dynamically
> growing.  I suspect that the bounded size of things like the hash
> tables involved in (for example) predicate locking is considered a
> feature, not a bug, so any such cluster-lifetime core-infrastructure
> hash table would not be a candidate.  More likely candidates would be
> ephemeral data used by the executor, as in the above-mentioned patch,
> and long lived caches of dynamic size owned by core code or
> extensions.  Like a shared query plan cache, if anyone can figure out
> the invalidation magic required.

Another thought: it could be used to make things like
pg_stat_statements not have to be in shared_preload_libraries.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Hash tables in dynamic shared memory

From
Magnus Hagander
Date:
<p dir="ltr"><p dir="ltr">On Oct 5, 2016 1:23 AM, "Thomas Munro" <<a
href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>>wrote:<br /> ><br /> > On Wed,
Oct5, 2016 at 12:11 PM, Thomas Munro<br /> > <<a
href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>>wrote:<br /> > > On Wed, Oct 5,
2016at 11:22 AM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>> wrote:<br /> >
>>>Potential use cases for DHT include caches, in-memory database objects<br /> > >>> and working
statefor parallel execution.<br /> > >><br /> > >> Is there a more concrete example, i.e. a user we'd
convertto this at<br /> > >> the same time as introducing this hashtable?<br /> > ><br /> > > A
colleagueof mine will shortly post a concrete patch to teach an<br /> > > existing executor node how to be
parallelaware, using DHT.  I'll let<br /> > > him explain.<br /> > ><br /> > > I haven't looked into
whetherit would make sense to convert any<br /> > > existing shmem dynahash hash table to use DHT.  The reason
fordoing<br /> > > so would be to move it out to DSM segments and enable dynamically<br /> > > growing.  I
suspectthat the bounded size of things like the hash<br /> > > tables involved in (for example) predicate locking
isconsidered a<br /> > > feature, not a bug, so any such cluster-lifetime core-infrastructure<br /> > >
hashtable would not be a candidate.  More likely candidates would be<br /> > > ephemeral data used by the
executor,as in the above-mentioned patch,<br /> > > and long lived caches of dynamic size owned by core code
or<br/> > > extensions.  Like a shared query plan cache, if anyone can figure out<br /> > > the
invalidationmagic required.<br /> ><br /> > Another thought: it could be used to make things like<br /> >
pg_stat_statementsnot have to be in shared_preload_libraries.<br /> ><p dir="ltr">That would indeed be a great
improvement.And possibly also allow the changing of the max number of statements it can track without a restart? <p
dir="ltr">Iwas also wondering if it might be useful for a replacement for some of the pgstats stuff to get rid of the
costof spooling to file and then rebuilding the hash tables in the receiving end. I've been waiting for this patch to
figureout if that's useful. I mean keep the stats collector doing what it does now over udp, but present the results in
sharedhash tables instead of files. <p dir="ltr">/Magnus <br /> 

Re: Hash tables in dynamic shared memory

From
Dilip Kumar
Date:
On Wed, Oct 5, 2016 at 3:10 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here is an experimental hash table implementation that uses DSA memory
> so that hash tables can be shared by multiple backends and yet grow
> dynamically.  Development name:  "DHT".

+dht_iterate_begin(dht_hash_table *hash_table,
+  dht_iterator *iterator,
+  bool exclusive)
+{
+ Assert(hash_table->control->magic == DHT_MAGIC);
+
+ iterator->hash_table = hash_table;
+ iterator->exclusive = exclusive;
+ iterator->partition = 0;
+ iterator->bucket = 0;
+ iterator->item = NULL;
+ iterator->locked = false;

While reviewing , I found that in dht_iterate_begin function, we are
not initializing
iterator->last_item_pointer to InvalidDsaPointer;
and during scan this variable will be used in advance_iterator
function. (I think this will create problem, even loss of some tuple
?)

+advance_iterator(dht_iterator *iterator, dsa_pointer bucket_head,
+ dsa_pointer *next)
+{
+ dht_hash_table_item *item;
+
+ while (DsaPointerIsValid(bucket_head))
+ {
+ item = dsa_get_address(iterator->hash_table->area,
+   bucket_head);
+ if ((!DsaPointerIsValid(iterator->last_item_pointer) ||
+ bucket_head < iterator->last_item_pointer) &&

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Thu, Oct 6, 2016 at 12:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> While reviewing , I found that in dht_iterate_begin function, we are
> not initializing
> iterator->last_item_pointer to InvalidDsaPointer;

Fixed, thanks.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Wed, Oct 5, 2016 at 7:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Oct 5, 2016 1:23 AM, "Thomas Munro" <thomas.munro@enterprisedb.com>
> wrote:
>>
>> On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>> > On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de>
>> > wrote:
>> >>> Potential use cases for DHT include caches, in-memory database objects
>> >>> and working state for parallel execution.
>> >>
>> >> Is there a more concrete example, i.e. a user we'd convert to this at
>> >> the same time as introducing this hashtable?
>> >
>> > A colleague of mine will shortly post a concrete patch to teach an
>> > existing executor node how to be parallel aware, using DHT.  I'll let
>> > him explain.
>> >
>> > I haven't looked into whether it would make sense to convert any
>> > existing shmem dynahash hash table to use DHT.  The reason for doing
>> > so would be to move it out to DSM segments and enable dynamically
>> > growing.  I suspect that the bounded size of things like the hash
>> > tables involved in (for example) predicate locking is considered a
>> > feature, not a bug, so any such cluster-lifetime core-infrastructure
>> > hash table would not be a candidate.  More likely candidates would be
>> > ephemeral data used by the executor, as in the above-mentioned patch,
>> > and long lived caches of dynamic size owned by core code or
>> > extensions.  Like a shared query plan cache, if anyone can figure out
>> > the invalidation magic required.
>>
>> Another thought: it could be used to make things like
>> pg_stat_statements not have to be in shared_preload_libraries.
>>
>
> That would indeed be a great improvement. And possibly also allow the
> changing of the max number of statements it can track without a restart?

Yeah.  You don't explicitly size a DHT hash table, it just grows as
required to keep the load factor low enough, possibly leading the DSA
area it lives in to create DSM segments.  Currently it never gets
smaller though, so if you throw out a bunch of entries you will be
freeing up the memory occupied by the entries themselves (meaning:
giving it back to the DSA area, which might eventually give back to
the OS if the planets are correctly aligned rendering a DSM segment
entirely unused), but the hash table's bucket array won't ever shrink.

> I was also wondering if it might be useful for a replacement for some of the
> pgstats stuff to get rid of the cost of spooling to file and then rebuilding
> the hash tables in the receiving end. I've been waiting for this patch to
> figure out if that's useful. I mean keep the stats collector doing what it
> does now over udp, but present the results in shared hash tables instead of
> files.

Interesting thought.  I haven't studied how that stuff works... I've
put it on my reading list.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Hash tables in dynamic shared memory

From
Andres Freund
Date:
On 2016-10-05 08:02:42 +0200, Magnus Hagander wrote:
> On Oct 5, 2016 1:23 AM, "Thomas Munro" <thomas.munro@enterprisedb.com>
> wrote:
> > Another thought: it could be used to make things like
> > pg_stat_statements not have to be in shared_preload_libraries.

> That would indeed be a great improvement. And possibly also allow the
> changing of the max number of statements it can track without a restart?

There's the issue of having to serialize / load the file around a
restart...

Greetings,

Andres Freund



Re: Hash tables in dynamic shared memory

From
John Gorman
Date:
I reviewed the dht-v2.patch and found that it is in excellent shape.

Benchmarking shows that this performs somewhat faster than
dynahash which is surprising because it is doing DSA address
translations on the fly.

One area where this could run faster is to reduce the amount of
time when the hash is in the RESIZE_IN_PROGRESS state.

When a hash partition reaches 75% full a resize begins by allocating
an empty new hash bucket array with double the number of buckets.
This begins the RESIZE_IN_PROGRESS state where there is both an
old and a new hash bucket array.

During the RESIZE state all operations such as find, insert,
delete and iterate run slower due to having to look items up in
two hash bucket arrays.

Whenever a new item is inserted into the hash and the hash is in
the RESIZE state the current code takes the time to move one item
from the old hash partition to the new hash partition. During
insertion an exclusive lock is already held on the partition so
this is an efficient time to do the resize cleanup work.

However since we only clean up one item per insertion we do not
complete the cleanup and free the old hash bucket array until we
are due to start yet another resize operation.

So we are effectively always in the RESIZE state which slows down
operations and wastes some space.

Here are the timings for insert and find in nanoseconds on a
Macbook Pro. The insert_resize figure includes the resize work to
move one item from the old to the new hash bucket array.

insert_resize: 945.98
insert_clean:  450.39
find_resize:   348.90
find_clean:    293.16

The attached dht-v2-resize-cleanup.patch patch applies on top of
the dht-v2.patch and speeds up the resize cleanup process so that
hashes spend most of their time in the clean state.

It does this by cleaning up more than one old item during
inserts. This patch cleans up three old items.

There is also the case where a hash receives all of its inserts
at the beginning and the rest of the work load is all finds. This
patch also cleans up two items for each find.

The downside of doing extra cleanup early is some additional
latency. Here are the experimental figures and the approximate
formulas for different numbers of cleanups for inserts. Note that
we cannot go lower than one cleanup per insert.

Resize work in inserts: 729.79 insert + 216.19 * cleanups
1  945.98
2 1178.13
3 1388.73
4 1617.04
5 1796.91

Here are the timings for different numbers of cleanups for finds.
Note that since we do not hold an exclusive partition lock on a find
there is the additional overhead of taking that lock.

Resize work in finds: 348.90 dirty_find + 233.45 lock + 275.78 * cleanups
0  348.90
1  872.88
2 1138.98
3 1448.94
4 1665.31
5 1975.91

The new effective latencies during the resize vs. the clean states.

#define DHT_INSERT_CLEANS 3
#define DHT_SEARCH_CLEANS 2

insert_resize: 1388.73  -- 3 cleaned
insert_clean:   450.39
find_resize:   1138.98  -- 2 cleaned
find_clean:     293.16

The overall performance will be faster due to not having to examine
more than one hash bucket array most of the time.

--

John Gorman
Attachment

Re: Hash tables in dynamic shared memory

From
Haribabu Kommi
Date:


On Sun, Nov 20, 2016 at 9:54 AM, John Gorman <johngorman2@gmail.com> wrote:
I reviewed the dht-v2.patch and found that it is in excellent shape.

The overall performance will be faster due to not having to examine
more than one hash bucket array most of the time.


Reviewer didn't find any problems in the approach of the patch. Provided some
improvements on the approach to the author to respond.

Moved to next CF with "waiting on author" state.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Sun, Nov 20, 2016 at 11:54 AM, John Gorman <johngorman2@gmail.com> wrote:
> I reviewed the dht-v2.patch and found that it is in excellent shape.

Thanks for reviewing!  And sorry for the late reply.

> Benchmarking shows that this performs somewhat faster than
> dynahash which is surprising because it is doing DSA address
> translations on the fly.

That is indeed surprising and I think warrants more investigation.

> One area where this could run faster is to reduce the amount of
> time when the hash is in the RESIZE_IN_PROGRESS state.
>
> When a hash partition reaches 75% full a resize begins by allocating
> an empty new hash bucket array with double the number of buckets.
> This begins the RESIZE_IN_PROGRESS state where there is both an
> old and a new hash bucket array.
>
> During the RESIZE state all operations such as find, insert,
> delete and iterate run slower due to having to look items up in
> two hash bucket arrays.
>
> Whenever a new item is inserted into the hash and the hash is in
> the RESIZE state the current code takes the time to move one item
> from the old hash partition to the new hash partition. During
> insertion an exclusive lock is already held on the partition so
> this is an efficient time to do the resize cleanup work.
>
> However since we only clean up one item per insertion we do not
> complete the cleanup and free the old hash bucket array until we
> are due to start yet another resize operation.
>
> So we are effectively always in the RESIZE state which slows down
> operations and wastes some space.

Right, that is the case in a workload that inserts a bunch of data but
then becomes read-only forever.  A workload that constantly does a mix
of writing and reading should settle down at a reasonable size and
stop resizing.

> Here are the timings for insert and find in nanoseconds on a
> Macbook Pro. The insert_resize figure includes the resize work to
> move one item from the old to the new hash bucket array.
>
> insert_resize: 945.98
> insert_clean:  450.39
> find_resize:   348.90
> find_clean:    293.16
>
> The attached dht-v2-resize-cleanup.patch patch applies on top of
> the dht-v2.patch and speeds up the resize cleanup process so that
> hashes spend most of their time in the clean state.
>
> It does this by cleaning up more than one old item during
> inserts. This patch cleans up three old items.
>
> There is also the case where a hash receives all of its inserts
> at the beginning and the rest of the work load is all finds. This
> patch also cleans up two items for each find.
>
> The downside of doing extra cleanup early is some additional
> latency. Here are the experimental figures and the approximate
> formulas for different numbers of cleanups for inserts. Note that
> we cannot go lower than one cleanup per insert.
>
> Resize work in inserts: 729.79 insert + 216.19 * cleanups
> 1  945.98
> 2 1178.13
> 3 1388.73
> 4 1617.04
> 5 1796.91
>
> Here are the timings for different numbers of cleanups for finds.
> Note that since we do not hold an exclusive partition lock on a find
> there is the additional overhead of taking that lock.
>
> Resize work in finds: 348.90 dirty_find + 233.45 lock + 275.78 * cleanups
> 0  348.90
> 1  872.88
> 2 1138.98
> 3 1448.94
> 4 1665.31
> 5 1975.91
>
> The new effective latencies during the resize vs. the clean states.
>
> #define DHT_INSERT_CLEANS 3
> #define DHT_SEARCH_CLEANS 2
>
> insert_resize: 1388.73  -- 3 cleaned
> insert_clean:   450.39
> find_resize:   1138.98  -- 2 cleaned
> find_clean:     293.16
>
> The overall performance will be faster due to not having to examine
> more than one hash bucket array most of the time.

Thanks for doing all these experiments.  Yeah, I think it makes sense
to merge this change to improve that case.

Since Dilip Kumar's Parallel Bitmap Heap Scan project is no longer
using this, I think I should park it here unless/until another
potential use case pops up.  Some interesting candidates have been
mentioned already, and I'm fairly sure there are other uses too, but I
don't expect anyone to be interested in committing this patch until
something concrete shows up, so I'll go work on other things until
then.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Hash tables in dynamic shared memory

From
Thomas Munro
Date:
On Mon, Dec 19, 2016 at 12:33 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Since Dilip Kumar's Parallel Bitmap Heap Scan project is no longer
> using this, I think I should park it here unless/until another
> potential use case pops up.  Some interesting candidates have been
> mentioned already, and I'm fairly sure there are other uses too, but I
> don't expect anyone to be interested in committing this patch until
> something concrete shows up, so I'll go work on other things until
> then.

Here is a rebased version.  Changes since v2:

1.  Out-of-memory conditions now raise errors (following
dsa_allocate's change in behaviour to match palloc).
2.  Moved into 'lib' where other reusable data structures live.

There are still a couple of things I'd like to adjust in this code
(including feedback from John and improvements to the iterator code
which is overcomplicated) but I'm posting it today as infrastructure
for another patch.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment