Hash tables in dynamic shared memory - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Hash tables in dynamic shared memory |
Date | |
Msg-id | CAEepm=3d8o8XdVwYT6O=bHKsKAM2pu2D6sV1S_=4d+jStVCE7w@mail.gmail.com Whole thread Raw |
Responses |
Re: Hash tables in dynamic shared memory
(Andres Freund <andres@anarazel.de>)
Re: Hash tables in dynamic shared memory (Dilip Kumar <dilipbalaut@gmail.com>) |
List | pgsql-hackers |
Hi hackers, Here is an experimental hash table implementation that uses DSA memory so that hash tables can be shared by multiple backends and yet grow dynamically. Development name: "DHT". It's impossible to write a general purpose hash table that will be suitable for every use case, considering all the combinations of design trade offs and subsets of functionality someone might want. What I'm trying to do here is produce something that is generally useful for many cases and easy to use, along the lines of dynahash, but use dynamic shared memory. Here is the bullet point summary: * DSA memory-backed * open hashing (separate chaining) * incremental resizing * built-in partition-based locking * concurrent iteration There is other work being done in this space: I'm aware of Andres's current thread[1] on Robin Hood-style closed hashing tables with macro-ised size, hash and comparator operations à la C++ container templates, and I'm following along with interest. He vigorously rejects chaining hash tables as the natural enemies of memory hardware. Obviously there are pros and cons: this node-based chaining design allows resizing, iteration with stronger guarantees, and users can point directly to (or into) the entries from other data structures, but yeah... it has to follow pointers into far away cache lines to do that. I guess Andres's simplehash should already work in DSA memory nicely anyway since it doesn't have any internal pointers IIUC (?). As for concurrency, Robert Haas also did some interesting experiments[2] with (nearly) lock-free hash tables and hazard pointers. I'm not sure what the optimum number of different implementations or levels of configurability is, and I'm certainly not wedded to this particular set of design tradeoffs. Potential use cases for DHT include caches, in-memory database objects and working state for parallel execution. (Not a use case: the shared hash table for my as-yet-unposted DSA-based shared parallel hash join table work, which follows the existing open-coded dense_alloc-based approach for reasons I'll explain later.) There are a couple of things I'm not happy about in this version of DHT: the space overhead per item is too high (hash + wasted padding + next pointer), and the iterator system is overly complicated. I have another version in development which gets rid of the hash and padding and sometimes gets rid of the next pointer to achieve zero overhead, and I'm working on a simpler iteration algorithm that keeps the current guarantees but doesn't need to reorder bucket chains. More on that, with some notes on performance and a testing module, soon. Please find attached dht-v1.patch, which depends on dsa-v2.patch[3]. I'm posting this version of DHT today because it is a dependency of another patch due on this list shortly. See also simple-cache.patch which contains a simple smoke test extension so you can type SELECT simple_cache_put('42', 'Hello world'), SELECT simple_cache_get('42') etc. Thanks to my colleagues Amit Khandekar, Amul Sul, Dilip Kumar and John Gorman for bug fixes and suggestions, and Robert Haas for feedback. Please let me know your thoughts, and thanks for reading. [1] https://www.postgresql.org/message-id/flat/20161003041215.tfrifbeext3i7nkj%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/CA%2BTgmoYE4t-Pt%2Bv08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CAEepm%3D1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: