Re: Use simplehash.h instead of dynahash in SMgr - Mailing list pgsql-hackers

From David Rowley
Subject Re: Use simplehash.h instead of dynahash in SMgr
Date
Msg-id CAApHDvopZhBHjKv=fUX-qdDQxaMe023YivZjNe4xxvJ741=rLw@mail.gmail.com
Whole thread Raw
In response to Re: Use simplehash.h instead of dynahash in SMgr  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Use simplehash.h instead of dynahash in SMgr
List pgsql-hackers
On Wed, 23 Jun 2021 at 12:17, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> David and I discussed this a bit off-list, and I just wanted to share
> how I understand the idea so far in case it helps someone else.  There
> are essentially three subcomponents working together:

Thanks for taking an interest in this. I started looking at your idea
and I've now changed my mind from just not liking it to thinking that
the whole idea is just completely horrible :-(

It gets really messy with all the nested pre-processor stuff around
fetching the element from the segmented array inside simplehash.  One
problem is that simplehash needs the address of the segments despite
simplehash not knowing anything about segments. I've tried to make
that work by passing in the generic hash struct as simplehash's
private_data. This ends up with deeply nested macros all defined in
different files. I pitty the future person debugging that.

There is also a problem of how to reference simplehash functions
inside the generichash code.  It's not possible to do things like
SH_CREATE() because all those macros are undefined at the end of
simplehash.h. It's no good to hardcode the names either as GH_PREFIX
must be used, else it wouldn't be possible to use more than 1
differenrly defined hash table per .c file.  Fixing this means either
modifying simplehash.h to not undefine all the name macros at the end
maybe with SH_NOUNDEF or creating another set of macros to build the
names for the simplehash functions inside the generic hash code. I
don't like either of those ideas.

There are also a bunch of changes / API breakages that need to be done
to make this work with simplehash.h.

1) Since I really need 8-byte buckets in the hash table to make this
as fast as possible, I want to use the array index for the hash status
and that means changing the simplehash API to allow that to work.
This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
SH_SET_BUCKET_EMPTY.
2) I need to add a new memory allocation function to not zero the
memory.  At the moment all hash buckets are emptied when creating a
table by zeroing the bucket memory. If someone defines
SH_SET_BUCKET_EMPTY to do something that says 0 memory is not empty,
then that won't work. So I need to allocate the bucket memory then
call SH_SET_BUCKET_EMPTY on each bucket.
3) I'll need to replace SH_KEY with something more complex.  Since the
simplehash bucket will just have a uint32 hashvalue and uint32 index,
the hash key is not stored in the bucket, it's stored over in the
segment. I'll need to replace SH_KEY with SH_GETKEY and SH_SETKEY.
These will need to consult the simplehash's private_data so that the
element can be found in the segmented array.

Also, simplehash internally manages when the hash table needs to grow.
I'll need to perform separate checks to see if the segmented array
also must grow.  It's a bit annoying to double up those checks as
they're in a very hot path as they're done everytime someone inserts
into the table.

> 2.  A bitmapset that tracks unused elements in 1, making it easy to
> find the lowest-index hole when looking for a place to put a new one
> by linear search for a 1 bit, so that we tend towards maximum density
> despite having random frees from time to time (seems good, the same
> idea is used in  kernels to allocate the lowest unused file descriptor
> number).

I didn't use Bitmapsets. I wanted the bitmaps to be allocated in the
same chunk of memory as the segments of the array.  Also, because
bitmapset's nwords is variable, then they can't really do any loop
unrolling.  Since in my implementation the number of bitmap words are
known at compile-time, the compiler has the flexibility to do loop
unrolling.  The bitmap manipulation is one of the biggest overheads in
generichash.h. I'd prefer to keep that as fast as possible.

> 3. A hash table that has as elements indexes into 1.  It somehow hides
> the difference between keys (what callers look things up with) and
> keys reachable by following an index into 1 (where elements' keys
> live).

I think that can be done, but it would require looking up the
segmented array twice instead of once.  The first time would be when
we compare the keys after seeing the hash values match. The final time
would be in the calling code to translate the index to the pointer.
Hopefully the compiler would be able to optimize that to a single
lookup.

> One thought is that you could do 1 as a separate component as the
> "primary" data structure, and use a plain old simplehash for 3 as a
> kind of index into it, but use pointers (rather than indexes) to
> objects in 1 as elements.  I don't know if it's better.

Using pointers would double the bucket width on a 64 bit machine. I
don't want to do that. Also, to be able to determine the segment from
the pointer it would require looping over each segment to check if the
pointer belongs there. With the index we can determine the segment
directly with bit-shifting the index.

So, with all that. I really don't think it's a great idea to try and
have this use simplehash.h code. I plan to pursue the idea I proposed
with having seperate hash table code that is coded properly to have
stable pointers into the data rather than trying to contort
simplehash's code into working that way.

David



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Heikki Linnakangas
Date:
Subject: Re: speed up verifying UTF-8