Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: RFC: Packing the buffer lookup table |
Date | |
Msg-id | CAEze2WgUFgrH7m6nbC5FTO8=tQJkcb5se1Gm+RX7gZQaRYAcHg@mail.gmail.com Whole thread Raw |
In response to | Re: RFC: Packing the buffer lookup table (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Wed, 5 Feb 2025 at 02:22, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote: > > On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > > > > > > Together that results in the following prototype patchset. > > > > Here's an alternative patch, which replaces dynahash in the buffer > > lookup table with an open-coded replacement that has fewer > > indirections during lookups, and allows for a much increased locality. > > > > Overall, it has less memory usage than our current dynahash in all > > cases; but compared to my previous patch it improves memory only in > > some cases, in others it uses more memory. > > > > The main design is a separate chaining hash table (similar to and > > derived from code of Dynahash) to provide efficient access to free > > elements. > > Why do you want to use a separate chaining hash table? For something as > performance sensitive as this the lower cache hit ratio seems like a strong > reason to not go for such a hash table "architecture"? I think it's hard to defend this patchset against a perfect solution, but lacking that perfect solution (and provided the current definitely not great solution), I'll share my thoughts on why I chose for a different hashmap. While wordy, I'm not trying to defend or attack anything or anyone, just explaining my reasoning. My main reasons are: 1.) this is a mostly drop-in replacement that, in the general case, is better in every metric vs dynahash; which is probably preferred over continuing the status quo given that it's now really starting to show its limits (see e.g. the hash_search_with_hash_value thread). 2.) I'm not experienced enough with writing non-constant-sized allocations in shared memory to get something like a radixtree.h implemented in shared memory. Given that threading is still some releases away, I thought a chaining hash table (which only needs constant size allocations and thus requires only a simple slab allocator) would be OK if it still improved on the status quo. Note that radixtree's implementation has nodes sizing from 16 bytes up to 1000+ bytes. Building an efficient allocator for that is difficult (see: bin-packing problem), assuming it needs to interface with shmem and thus statically pre-allocated. 3.) simplehash.h can't currently be used due to requiring resizing on certain intervals, and we can't just dynamically re-allocate the table. This makes the implementation unusable. 4.) simplehash also invalidates cache lines containing otherwise unmodified elements due to its relocation strategy, which isn't great either for cache sharing. 5.) A big reason why dynahash is slow, apart from random unpredictable memory accesses, is said to be because it needs to follow 3 pointers before it has its first element: The hash table has a pointer to a segment, which is a pointer to a bucket, which is a pointer to an ELEMENT - and the hash table itself is an allocated struct and thus needs to be loaded through a pointer as well. It isn't the only reason (e.g. bucket chains are expensive, too) but probably one large reason why it's so slow. As for what the buffer lookup table worries about; I considered the following: cache locality: How quickly can you find an element, starting with 0 info) dynahash (--) newhash (-) simplehash (+) radixtree (~?) cache line dirtying: How many cache lines are dirtied with insert/delete operations dynahash (+++) newhash (++) simplehash (--) radixtree (++ ?) spatial locality: How close are "near" elements to eachother in the structure? dynahash (---) newhash (--?) simplehash (-?) radixtree (++) partitioning: Can the structure be written to by multiple backends concurrently? dynahash (+++) newhash (+++) simplehash (---) radixtree (--?) memory usage: Compare memory overhead allocated for random buffers dynahash (~) newhash (+) simplehash (++) radixtree (-?) > I think it'd be better to work towards not using a hash table for the buffer > mapping table than to write a dedicated hash table implementation for it > though. A hash table > > a) doesn't have ordered access support, which causes a bunch of O(NBuffers) > operations and makes things like readahead etc way more expensive > b) doesn't benefit from spatial locality, even though buffer lookups have a > very high degree of spatial locality While I hear those arguments, I'm not sure it's as easy as you make it seem. You yourself have said that "dynahash is [a] really poor hash table implementation." This new hash table was 'only' a few days' work, and seems to at least have fixed at least 2 of the major issues we see in dynahash: It reduces the pointer chasing for first result to 2 in the opmal case, and (with some further changes which I have yet to share) it may achieve a much higher average locality vs the current, unmodified, dynahash. I did look at other solutions (specifically, adapting radixtree.h and/or simplehash.h) but as I've mentioned above, the memory tooling required for these solutions were a major turnoff as they significantly increase the complexity of such a project. Kind regards, Matthias van de Meent Neon (https://neon.tech)
pgsql-hackers by date: