Home > mailing lists
Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers

From	Matthias van de Meent
Subject	Re: RFC: Packing the buffer lookup table
Date	February 4 23:46:57
Msg-id	CAEze2WiNEjjx7PWz8WJwkB=BV9RQ+brjJSmpM+npS3to-3GiGg@mail.gmail.com Whole thread Raw
List	pgsql-hackers
Tree view
On Fri, 31 Jan 2025 at 18:23, James Hunter <james.hunter.pg@gmail.com> wrote:
>
> On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > Hi,
> >
> > Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT
header(of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffersarray, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes) 
> >
> > Does anyone know why we must have the buffer tag in the buffer table?
> > It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to
comparethe tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow
thepointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the
4-byteshared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%:
...
>
> So every buffer table entry is 40 bytes, and every buffer itself is 8
> KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems).
> So buffer table entry is < 0.5% of total buffer space.

Correct. But note that most buffers generally aren't all being
accessed, and thus aren't (fully) present in CPU caches.

> I assume this means that the benefit of your patch doesn't come from
> memory savings, but maybe from spatial locality? As in, saving 0.5% of
> memory doesn't seem like a huge win by itelf, but maybe the hash table
> will see fewer cache misses, since the entries are smaller, so they'll
> be packed closer together. Now you can fit 2.7 hash entries on a cache
> line, instead of 1.6.

Yes - though it's actually 4 entries per cache line: While the buckets
array uses space proportional to the total number of entries, we won't
access that bucket array again once we've used it, so any subsequent
memory accesses are 2.5x more likely to hit a cache line, compared to
the chances of what's currently on master.

> Except, to check the buffer tag itself, now you have to dereference
> the offset pointer into the shared BufferDesc array. And every
> BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line.
> So, now, instead of having the buffer tag adjacent to the hash entry,
> you have a memory stall.

Correct - though that only happens for hits and hash collisions. Hash
collisions are considered rare (assuming linear hashing, it'd be
1/2^16), and for hits you would load that buffer tag anyway, so the
cost is a potential cache miss _while holding the buftable lock_, be
it in shared or exclusive mode (most likely: shared, as we don't do
blind updates to the hash table).

> If the buffer tags match (i.e., no collision), this is still a
> possible win, because you'd have to load the BufferDesc anyway (since
> this is the buffer you're looking for).

Exactly.

> > Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on
howto improve this? Specific concerns? 
>
> What's the expected advantage of reducing the buffer table's entry
> size? Better CPU cache usage when there's no collision, but worse in
> the less-likely case where there is a collision?

Yes, plus a reduced overhead of defining a large shared buffers. Over
in [0] people discuss changing the size of shared_buffers on-the-fly.
When we have that capability, the overhead of allocating large shared
buffers can become a meaningful fraction of total shared memory in
reduced-shared_buffers systems: Currently, we have ±144
bytes/shared_buffers entry in shared_memory allocations, excluding the
page itself. This means that if someone wants to be able to scale
their buffers with a factor 1:100 min:max, the size of those shared
buffers' metadata allocations is 1.7x as large as the size of the
buffers in its smallest configuration. [^1]

While it is currently definitely not a reasonable expectation to
expect 100x scaling, I think reducing this overhead factor is going to
be very useful in the long run. See also [2], where cache misses in
dynahash are probably near 25% of the total workload - by reducing the
size of these structs we can increase the amount of data we can fit in
the same amount of buffers, thus increasing performance for some
cache-size limited workloads that hit the lookup tables.

> What I mean is, regarding benchmarks: what's the best case scenario
> for this kind of patch, and what sort of performance difference would
> you expect to see?

I assume that for the provided patch:
 - there will be no (or minimal) regressions, and improved performance
when the buffer mapping table now fits in l2/l3 caches (if it
previously didn't).
 - there can be some regressions if the additional offset calculation
is performance-critical.
 - there can be some regressions if an additional cache miss during
redirection while holding the lock is performance-critical.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/flat/CA%2BTgmoaGCFPhMjz7veJOeef30%3DKdpOxgywcLwNbr-Gny-mXwcg%40mail.gmail.com

[^1] I make the assumption that when we resize the storage of pages of
shared_buffers we don't also resize the buffer mapping table and the
bufferdesc array; as blocking processes from accessing page data based
on buffer descriptors' state is easier than totally rewiring all
systems that currently assume constant-size bufferdescs. With effort
it might be possible to resize these, but I don't see an easy path
forward for that, while the 8KB page of every shared_buffers slot is
relatively easier to remove from memory.

[2] https://www.postgresql.org/message-id/flat/CA%2BCZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb%3DoK4fWw%40mail.gmail.com
pgsql-hackers by date:
From: Andres Freund
Date: 04 February, 23:46:42
Subject: Re: Windows CFBot is broken because ecpg dec_test.c error
From: Sami Imseih
Date: 04 February, 23:59:05
Subject: Re: Sample rate added to pg_stat_statements
Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers

Previous

Next