Home > mailing lists

Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers

From	Matthias van de Meent
Subject	Re: RFC: Packing the buffer lookup table
Date	February 6 02:12:00
Msg-id	CAEze2WgLtdFEszUK_-o6+r_X8Op=_caCQGiXyKvLOeVgWkcG9Q@mail.gmail.com Whole thread Raw
List	pgsql-hackers

Tree view

On Wed, 5 Feb 2025 at 02:14, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:
> > Some time ago I noticed that every buffer table entry is quite large at 40
> > bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
> > padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
> > buffers array, with generally 8 more bytes used for the bucket pointers.
> > (32-bit systems: 32 (+4) bytes)
> >
> > Does anyone know why we must have the buffer tag in the buffer table?
>
> It turns out to actually substantially improve CPU cache hit ratio on
> concurrent workloads. The BufferDesc is obviously frequently modified. Each
> modification (pin, lock) will pull the cacheline into modified state, within a
> single core. Causing higher latency when accessing that data on other
> cores. That's obviously not great for a crucial hashtable...  I think it
> mainly matters for things like inner index pages etc.
>
> It turns out that these misses due to dirtying already costs us rather dearly
> when running read-heavy workloads on bigger machines, mainly due to nbtree
> code doing things like BufferGetBlockNumber().

That is something I hadn't thought about, but indeed, you're right
that this wouldn't be great.

> It'd be interesting to benchmark how a separate, more densely packed,
> BufferTag array, shared by the hashtable and the BufferDesc itself.  Obviously
> that'd be a somewhat painful change.

Such a patch is actually not that bad, as surprisingly few files
actually touch on BufferDesc (let alone BufferDesc->tag - though the
number of places with changes is still 100+). I've prototyped with it,
and it removes another 12 bytes from the overhead of each buffer
(assuming we want to pack BufferDesc at 32 bytes, as that's possible).

> > Does anyone have an idea on how to best benchmark this kind of patch, apart
> > from "running pgbench"? Other ideas on how to improve this? Specific
> > concerns?
>
> I'd recommend benchmarking at least the following workloads, all fully shared
> buffer cache resident:
[...]

Thanks for the suggestions!

> It's unfortunately fairly important to test these both with a single client
> *and* a large number of clients on a multi-socket server. The latter makes
> cache miss latency much more visible.
>
> It might be worth looking at perf c2c profiles for before/after, I'd expect it
> to change noticeably. Might also be worth at looking at perf stat for cache
> misses, hitm, etc.

Hmm. I'll see if I can get some hardware to test this.

FYI, I've pushed a newer version of the 'newhash' approach to Github,
at [0]. It extracts the buffer tags from the BufferDesc into its own
array to reduce the problems with false sharing due to pins, and has
some more code to try and forcefully increase the locality of hash
elements.

There are still a few more potential changes that can increase cache
locality of hash elements with the buckets that store their data, but
I have no immediate plans for that.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://github.com/MMeent/postgres/tree/feat/new-buftable-hash

pgsql-hackers by date:

From: Corey Huinker
Date: 06 February, 01:57:07
Subject: Re: should we have a fast-path planning for OLTP starjoins?

From: Dean Rasheed
Date: 06 February, 02:25:30
Subject: Re: Virtual generated columns

Re: RFC: Packing the buffer lookup table - Mailing list pgsql-hackers

Previous

Next