Thread: allowing broader use of simplehash

allowing broader use of simplehash

From

Robert Haas

Date:

10 December 2019, 18:07:02

I recently became annoyed while working on patch A that I could not
use simplehash in shared memory, and then I became annoyed again while
working on patch B that I could not use simplehash in frontend code.
So here are a few patches for discussion.

A significant problem in either case is that a simplehash wants to
live in a memory context; no such thing exists either for data in
shared memory nor in frontend code. However, it seems to be quite easy
to provide a way for simplehash to be defined so that it doesn't care
about memory contexts. See 0001.

As far as frontend code goes, the only other problem I found is that
it makes use of elog() to signal some internal-ish messages. It seemed
to me that the easiest thing to do was, if FRONTEND is defined, use
pg_log_error(...) instead of elog(ERROR, ...). For the one instance of
elog(LOG, ...) in simplehash.h, I chose to use pg_log_info(). It's not
really equivalent, but it's probably the closest thing that currently
exists, and I think it's good enough for what's basically a debugging
message. See 0002.

I think those changes would also be enough to allow simplehash to be
used in a dynamic shared area (DSA). Using it in the main shared
memory segment seems more problematic, because simplehash relies on
being able to resize the hash table. Shared hash tables must have a
fixed maximum size, but with dynahash, we can count on being able to
use all of the entries without significant performance degradation.
simplehash, on the other hand, uses linear probing and relies on being
able to grow the hash table as a way of escaping collisions. By
default, the load factor is not permitted to drop below 0.1, so to
mimic the collision-avoidance behavior that we get in backend-private
uses of simplehash, we'd have to overallocate by 10x, which doesn't
seem desirable.

I'd really like to have an alternative to dynahash, which is awkward
to use and probably not particularly fast, but I'm not sure simplehash
is it.  Maybe what we really need is a third (or nth) hash table
implementation.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: allowing broader use of simplehash

From

Andres Freund

Date:

10 December 2019, 21:59:54

Hi,

Neat!

On 2019-12-10 13:07:02 -0500, Robert Haas wrote:
> I recently became annoyed while working on patch A that I could not
> use simplehash in shared memory, and then I became annoyed again while
> working on patch B that I could not use simplehash in frontend code.
> So here are a few patches for discussion.

I wanted to use it in frontend code a couple times as well.

> A significant problem in either case is that a simplehash wants to
> live in a memory context; no such thing exists either for data in
> shared memory nor in frontend code. However, it seems to be quite easy
> to provide a way for simplehash to be defined so that it doesn't care
> about memory contexts. See 0001.

I wonder if we shouldn't instead just go for an "implicit" memory
context instead. It's a bit ugly to have a growing set of different
signatures.

> As far as frontend code goes, the only other problem I found is that
> it makes use of elog() to signal some internal-ish messages. It seemed
> to me that the easiest thing to do was, if FRONTEND is defined, use
> pg_log_error(...) instead of elog(ERROR, ...). For the one instance of
> elog(LOG, ...) in simplehash.h, I chose to use pg_log_info(). It's not
> really equivalent, but it's probably the closest thing that currently
> exists, and I think it's good enough for what's basically a debugging
> message. See 0002.

Yea, I think that's fine.

> I think those changes would also be enough to allow simplehash to be
> used in a dynamic shared area (DSA). Using it in the main shared
> memory segment seems more problematic, because simplehash relies on
> being able to resize the hash table. Shared hash tables must have a
> fixed maximum size, but with dynahash, we can count on being able to
> use all of the entries without significant performance degradation.
> simplehash, on the other hand, uses linear probing and relies on being
> able to grow the hash table as a way of escaping collisions. By
> default, the load factor is not permitted to drop below 0.1, so to
> mimic the collision-avoidance behavior that we get in backend-private
> uses of simplehash, we'd have to overallocate by 10x, which doesn't
> seem desirable.

It'd be fine to set SH_GROW_MIN_FILLFACTOR to something higher, for many
uses. I've only added that after the fact, because somebody demonstrated
a workload with SQL level data that had a *lot* of conflicts with our
hash functions. But that shouldn't be a concern for most other uses.

> I'd really like to have an alternative to dynahash, which is awkward
> to use and probably not particularly fast, but I'm not sure simplehash
> is it.  Maybe what we really need is a third (or nth) hash table
> implementation.

I think it depends a bit on what use-cases you'd like to cover? I think
there's unfortunately a lot of tradeoffs here that are hard to square:

1) For performance, we want the hashtable code to be specialized for the
   specific key/value combination. I'm not aware of a way to do that
   without some code generation thing like simplehash. Being able to use
   simpler pointer math by having fixed sizes, and avoiding indirect
   function calls, is important for performance.

   It's fairly annoying to have to do the song-and-dance for simplehash
   when it's just a local lookup table or something.

2) For performance, using a chained hashtable turns out to be
   problematic, if the hashtable will often get big (for small tables
   the CPU caches makes it ok).  It's hard to avoid reallocations
   (and/or smoother growth) for non-chaining tables however.

3) For lots of one-off uses of hashtables that aren't performance
   critical, we want a *simple* API. That IMO would mean that key/value
   end up being separately allocated pointers, and that just a
   comparator is provided when creating the hashtable.

4) For some hashtables it's important to be very concurrent - but it's
   considerably harder to do that with an open addressing one.

While I don't think it's possible to avoid compromise on all these
aspects, I think it'd be a lot more realistic to have one implementation
fulfilling most needs (except perhaps the concurrency part) if we didn't
have the limitations of C.  This kind of thing really is one where
e.g. C++ style templates are just extremely hard to beat in C.

Greetings,

Andres Freund

Re: allowing broader use of simplehash

From

Robert Haas

Date:

11 December 2019, 15:05:00

On Tue, Dec 10, 2019 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> Neat!

Thanks.

> > A significant problem in either case is that a simplehash wants to
> > live in a memory context; no such thing exists either for data in
> > shared memory nor in frontend code. However, it seems to be quite easy
> > to provide a way for simplehash to be defined so that it doesn't care
> > about memory contexts. See 0001.
>
> I wonder if we shouldn't instead just go for an "implicit" memory
> context instead. It's a bit ugly to have a growing set of different
> signatures.

I don't really know what you mean by this. I don't actually think the
different signatures are a big deal. It affects a pretty limited
number of functions, and that seems less ugly than trying to create
some sort of dummy not-really-a-context object that can live in
frontend code, and a lot easier than actually making contexts work in
frontend code. The latter might be the better long-term solution, but
I don't think we should insist on doing it first.

Another way forward would be to replace the MemoryContext references
with a void * that happens, in the case of the backend, to be a
MemoryContext, and could be NULL when none is required. However, that
would give up some type-checking for no current benefit. If simplehash
becomes more widely used and at some point it's clear that this would
be a net win, we can change it then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allowing broader use of simplehash

From

Robert Haas

Date:

11 December 2019, 15:50:16

On Tue, Dec 10, 2019 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> 3) For lots of one-off uses of hashtables that aren't performance
>    critical, we want a *simple* API. That IMO would mean that key/value
>    end up being separately allocated pointers, and that just a
>    comparator is provided when creating the hashtable.

I think the simplicity of the API is a key point. Some things that are
bothersome about dynahash:

- It knows about memory contexts and insists on having its own.
- You can't just use a hash table in shared memory; you have to
"attach" to it first and have an object in backend-private memory.
- The usual way of getting a shared hash table is ShmemInitHash(), but
that means that the hash table has its own named chunk and that it's
in the main shared memory segment. If you want to put it inside
another chunk or put it in DSM or whatever, it doesn't work.
- It knows about LWLocks and if it's a shared table it needs its own
tranche of them.
- hash_search() is hard to wrap your head around.

One thing I dislike about simplehash is that the #define-based
interface is somewhat hard to use. It's not that it's a bad design.
It's just you have to sit down and think for a while to figure out
which things you need to #define in order to get it to do what you
want. I'm not sure that's something that can or needs to be fixed, but
it's something to consider. Even dynahash, as annoying as it is, is in
some ways easier to get up and running.

Probably the two most common uses cases are: (1) a fixed-sized shared
memory hash table of fixed-size entries where the key is the first N
bytes of the entry and it never grows, or (2) a backend-private or
perhaps frontend hash table of fixed-size entries where the key is the
first N bytes of the entry, and it grows without limit. I think should
consider having specialized APIs for those two cases and then more
general APIs that you can use when that's not enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allowing broader use of simplehash

From

Andres Freund

Date:

12 December 2019, 19:33:26

Hi,

On 2019-12-11 10:05:00 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2019 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> > > A significant problem in either case is that a simplehash wants to
> > > live in a memory context; no such thing exists either for data in
> > > shared memory nor in frontend code. However, it seems to be quite easy
> > > to provide a way for simplehash to be defined so that it doesn't care
> > > about memory contexts. See 0001.
> >
> > I wonder if we shouldn't instead just go for an "implicit" memory
> > context instead. It's a bit ugly to have a growing set of different
> > signatures.
> 
> I don't really know what you mean by this. I don't actually think the
> different signatures are a big deal. It affects a pretty limited
> number of functions, and that seems less ugly than trying to create
> some sort of dummy not-really-a-context object that can live in
> frontend code, and a lot easier than actually making contexts work in
> frontend code. The latter might be the better long-term solution, but
> I don't think we should insist on doing it first.

I was basically just thinking that we could pass the context to use via
CurrentMemoryContext, instead of explicitly passing it in.


> Another way forward would be to replace the MemoryContext references
> with a void * that happens, in the case of the backend, to be a
> MemoryContext, and could be NULL when none is required. However, that
> would give up some type-checking for no current benefit. If simplehash
> becomes more widely used and at some point it's clear that this would
> be a net win, we can change it then.

Yea, that seems worse. I'd rather work on the MemoryContext
infrastructure being available for frontend code.

Greetings,

Andres Freund

Re: allowing broader use of simplehash

From

Andres Freund

Date:

12 December 2019, 19:51:40

Hi,

On 2019-12-11 10:50:16 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2019 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> > 3) For lots of one-off uses of hashtables that aren't performance
> >    critical, we want a *simple* API. That IMO would mean that key/value
> >    end up being separately allocated pointers, and that just a
> >    comparator is provided when creating the hashtable.
> 
> I think the simplicity of the API is a key point. Some things that are
> bothersome about dynahash:
> 
> - It knows about memory contexts and insists on having its own.

Which is a waste, in a good number of cases.

> - You can't just use a hash table in shared memory; you have to
> "attach" to it first and have an object in backend-private memory.

I'm not quite sure there's all that good an alternative to this,
tbh. For efficiency it's useful to have backend-local state, I
think. And I don't really see how to have that without needing to attach.

> - The usual way of getting a shared hash table is ShmemInitHash(), but
> that means that the hash table has its own named chunk and that it's
> in the main shared memory segment. If you want to put it inside
> another chunk or put it in DSM or whatever, it doesn't work.

I don't think it's quite realistic for the same implementation - although
the code could partially be shared and just specialized for both cases -
to be used for DSM and "normal" shared memory. That's however not an
excuse to have drastically different interfaces for both.

> - It knows about LWLocks and if it's a shared table it needs its own
> tranche of them.
> - hash_search() is hard to wrap your head around.
>

> One thing I dislike about simplehash is that the #define-based
> interface is somewhat hard to use. It's not that it's a bad design.

I agree. It's the best I could come up taking the limitations of C into
account, when focusing on speed and type safety.  I really think this
type of hack is a stopgap measure, and we ought to upgrade to a subset
of C++.

> It's just you have to sit down and think for a while to figure out
> which things you need to #define in order to get it to do what you
> want. I'm not sure that's something that can or needs to be fixed, but
> it's something to consider. Even dynahash, as annoying as it is, is in
> some ways easier to get up and running.

I have been wondering about providing one simplehash wrapper in a
central place that uses simplehash to store a {key*, value*}, and has a
creation interface that just accepts a comparator. Plus a few wrapper
creation functions for specific types (e.g. string, oid, int64).  While
we'd not want to use that for really performance critical paths, for 80%
of the cases it'd be sufficient.

Greetings,

Andres Freund

Re: allowing broader use of simplehash

From

Robert Haas

Date:

15 December 2019, 03:24:35

On Thu, Dec 12, 2019 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:
> I was basically just thinking that we could pass the context to use via
> CurrentMemoryContext, instead of explicitly passing it in.

I thought about that, but as a general rule, replacing a function
parameter with a global variable is the wrong direction. One could
argue this particular case is a counterexample, and I won't fight
tooth and nail if you want to take that position, but I don't think I
believe it myself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allowing broader use of simplehash

From

Robert Haas

Date:

17 December 2019, 19:17:50

On Sat, Dec 14, 2019 at 10:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 12, 2019 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:
> > I was basically just thinking that we could pass the context to use via
> > CurrentMemoryContext, instead of explicitly passing it in.
>
> I thought about that, but as a general rule, replacing a function
> parameter with a global variable is the wrong direction. One could
> argue this particular case is a counterexample, and I won't fight
> tooth and nail if you want to take that position, but I don't think I
> believe it myself.

After confirming with Andres that he didn't have an objection to me
pressing forward with these patches, I have committed them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company