Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsHYJxc=D9Bnep2N++44KHuUFg86mmPzLxs+4tc6=fWfYA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.

While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: contrib: auth_delay module
Next
From: Pavel Stehule
Date:
Subject: Re: Schema variables - new implementation for Postgres 15