Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: RFC: Packing the buffer lookup table
Date
Msg-id CAEze2WgUFgrH7m6nbC5FTO8=tQJkcb5se1Gm+RX7gZQaRYAcHg@mail.gmail.com
Whole thread Raw
In response to Re: RFC: Packing the buffer lookup table  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, 5 Feb 2025 at 02:22, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote:
> > On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > >
> > > Together that results in the following prototype patchset.
> >
> > Here's an alternative patch, which replaces dynahash in the buffer
> > lookup table with an open-coded replacement that has fewer
> > indirections during lookups, and allows for a much increased locality.
> >
> > Overall, it has less memory usage than our current dynahash in all
> > cases; but compared to my previous patch it improves memory only in
> > some cases, in others it uses more memory.
> >
> > The main design is a separate chaining hash table (similar to and
> > derived from code of Dynahash) to provide efficient access to free
> > elements.
>
> Why do you want to use a separate chaining hash table? For something as
> performance sensitive as this the lower cache hit ratio seems like a strong
> reason to not go for such a hash table "architecture"?

I think it's hard to defend this patchset against a perfect solution,
but lacking that perfect solution (and provided the current definitely
not great solution), I'll share my thoughts on why I chose for a
different hashmap. While wordy, I'm not trying to defend or attack
anything or anyone, just explaining my reasoning.

My main reasons are:

1.) this is a mostly drop-in replacement that, in the general case, is
better in every metric vs dynahash; which is probably preferred over
continuing the status quo given that it's now really starting to show
its limits (see e.g. the hash_search_with_hash_value thread).

2.) I'm not experienced enough with writing non-constant-sized
allocations in shared memory to get something like a radixtree.h
implemented in shared memory. Given that threading is still some
releases away, I thought a chaining hash table (which only needs
constant size allocations and thus requires only a simple slab
allocator) would be OK if it still improved on the status quo.
Note that radixtree's implementation has nodes sizing from 16 bytes up
to 1000+ bytes. Building an efficient allocator for that is difficult
(see: bin-packing problem), assuming it needs to interface with shmem
and thus statically pre-allocated.

3.) simplehash.h can't currently be used due to requiring resizing on
certain intervals, and we can't just dynamically re-allocate the
table. This makes the implementation unusable.

4.) simplehash also invalidates cache lines containing otherwise
unmodified elements due to its relocation strategy, which isn't great
either for cache sharing.

5.) A big reason why dynahash is slow, apart from random unpredictable
memory accesses, is said to be because it needs to follow 3 pointers
before it has its first element: The hash table has a pointer to a
segment, which is a pointer to a bucket, which is a pointer to an
ELEMENT - and the hash table itself is an allocated struct and thus
needs to be loaded through a pointer as well. It isn't the only reason
(e.g. bucket chains are expensive, too) but probably one large reason
why it's so slow.

As for what the buffer lookup table worries about; I considered the following:

cache locality: How quickly can you find an element, starting with 0 info)
dynahash (--) newhash (-) simplehash (+) radixtree (~?)
cache line dirtying: How many cache lines are dirtied with
insert/delete operations
dynahash (+++) newhash (++) simplehash (--) radixtree (++ ?)
spatial locality: How close are "near" elements to eachother in the structure?
dynahash (---) newhash (--?) simplehash (-?) radixtree (++)
partitioning: Can the structure be written to by multiple backends concurrently?
dynahash (+++) newhash (+++) simplehash (---) radixtree (--?)
memory usage: Compare memory overhead allocated for random buffers
dynahash (~) newhash (+) simplehash (++) radixtree (-?)


> I think it'd be better to work towards not using a hash table for the buffer
> mapping table than to write a dedicated hash table implementation for it
> though.  A hash table
>
> a) doesn't have ordered access support, which causes a bunch of O(NBuffers)
>    operations and makes things like readahead etc way more expensive
> b) doesn't benefit from spatial locality, even though buffer lookups have a
>    very high degree of spatial locality

While I hear those arguments, I'm not sure it's as easy as you make it
seem. You yourself have said that "dynahash is [a] really poor hash
table implementation." This new hash table was 'only' a few days'
work, and seems to at least have fixed at least 2 of the major issues
we see in dynahash: It reduces the pointer chasing for first result to
2 in the opmal case, and (with some further changes which I have yet
to share) it may achieve a much higher average locality vs the
current, unmodified, dynahash.
I did look at other solutions (specifically, adapting radixtree.h
and/or simplehash.h) but as I've mentioned above, the memory tooling
required for these solutions were a major turnoff as they
significantly increase the complexity of such a project.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Confusing variable naming in LWLockRelease
Next
From: Corey Huinker
Date:
Subject: Re: should we have a fast-path planning for OLTP starjoins?