Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: RFC: Packing the buffer lookup table |
Date | |
Msg-id | CAEze2WiNEjjx7PWz8WJwkB=BV9RQ+brjJSmpM+npS3to-3GiGg@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
On Fri, 31 Jan 2025 at 18:23, James Hunter <james.hunter.pg@gmail.com> wrote: > > On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > > Hi, > > > > Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header(of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffersarray, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes) > > > > Does anyone know why we must have the buffer tag in the buffer table? > > It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to comparethe tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow thepointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byteshared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: ... > > So every buffer table entry is 40 bytes, and every buffer itself is 8 > KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems). > So buffer table entry is < 0.5% of total buffer space. Correct. But note that most buffers generally aren't all being accessed, and thus aren't (fully) present in CPU caches. > I assume this means that the benefit of your patch doesn't come from > memory savings, but maybe from spatial locality? As in, saving 0.5% of > memory doesn't seem like a huge win by itelf, but maybe the hash table > will see fewer cache misses, since the entries are smaller, so they'll > be packed closer together. Now you can fit 2.7 hash entries on a cache > line, instead of 1.6. Yes - though it's actually 4 entries per cache line: While the buckets array uses space proportional to the total number of entries, we won't access that bucket array again once we've used it, so any subsequent memory accesses are 2.5x more likely to hit a cache line, compared to the chances of what's currently on master. > Except, to check the buffer tag itself, now you have to dereference > the offset pointer into the shared BufferDesc array. And every > BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line. > So, now, instead of having the buffer tag adjacent to the hash entry, > you have a memory stall. Correct - though that only happens for hits and hash collisions. Hash collisions are considered rare (assuming linear hashing, it'd be 1/2^16), and for hits you would load that buffer tag anyway, so the cost is a potential cache miss _while holding the buftable lock_, be it in shared or exclusive mode (most likely: shared, as we don't do blind updates to the hash table). > If the buffer tags match (i.e., no collision), this is still a > possible win, because you'd have to load the BufferDesc anyway (since > this is the buffer you're looking for). Exactly. > > Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on howto improve this? Specific concerns? > > What's the expected advantage of reducing the buffer table's entry > size? Better CPU cache usage when there's no collision, but worse in > the less-likely case where there is a collision? Yes, plus a reduced overhead of defining a large shared buffers. Over in [0] people discuss changing the size of shared_buffers on-the-fly. When we have that capability, the overhead of allocating large shared buffers can become a meaningful fraction of total shared memory in reduced-shared_buffers systems: Currently, we have ±144 bytes/shared_buffers entry in shared_memory allocations, excluding the page itself. This means that if someone wants to be able to scale their buffers with a factor 1:100 min:max, the size of those shared buffers' metadata allocations is 1.7x as large as the size of the buffers in its smallest configuration. [^1] While it is currently definitely not a reasonable expectation to expect 100x scaling, I think reducing this overhead factor is going to be very useful in the long run. See also [2], where cache misses in dynahash are probably near 25% of the total workload - by reducing the size of these structs we can increase the amount of data we can fit in the same amount of buffers, thus increasing performance for some cache-size limited workloads that hit the lookup tables. > What I mean is, regarding benchmarks: what's the best case scenario > for this kind of patch, and what sort of performance difference would > you expect to see? I assume that for the provided patch: - there will be no (or minimal) regressions, and improved performance when the buffer mapping table now fits in l2/l3 caches (if it previously didn't). - there can be some regressions if the additional offset calculation is performance-critical. - there can be some regressions if an additional cache miss during redirection while holding the lock is performance-critical. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/flat/CA%2BTgmoaGCFPhMjz7veJOeef30%3DKdpOxgywcLwNbr-Gny-mXwcg%40mail.gmail.com [^1] I make the assumption that when we resize the storage of pages of shared_buffers we don't also resize the buffer mapping table and the bufferdesc array; as blocking processes from accessing page data based on buffer descriptors' state is easier than totally rewiring all systems that currently assume constant-size bufferdescs. With effort it might be possible to resize these, but I don't see an easy path forward for that, while the 8KB page of every shared_buffers slot is relatively easier to remove from memory. [2] https://www.postgresql.org/message-id/flat/CA%2BCZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb%3DoK4fWw%40mail.gmail.com
pgsql-hackers by date: