Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsEJ7PhhE6dKXtVm3YnT=T7wgNWbOEEXRXm5wDMyKBG2Yw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
> >
>
> Thank you for making progress on this. I agree with these directions
> overall. I have some comments and questions:

Glad to hear it and thanks for looking!

> > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
>
> Within the size class, we just alloc a new node of lower size class
> and do memcpy(). I guess it will be almost same as what we do for
> growing.

Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

> I don't think
> shrinking class-3 to class-1 makes sense.

Agreed. The smallest kind should just be freed when empty.

> > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
>
> Yes, but it also means that we use pointer-sized value anyway even if
> the value size is less than that, which wastes the memory, no?

At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:

psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660

(using a low enough number that the experimental change n125->n63 doesn't affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025

v31:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
      47800768 |     253 |       134

(unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
      42926048 |     221 |       127

I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)

So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. That said, I've mostly left alone the "leaf" types and functions, as well as added some detritus like "const bool = false;". It would look a *lot* nicer if we gave up on multi-value leaves entirely, but there's no rush and I don't want to close that door entirely just yet.

> > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
>
> Sounds good.

The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint. That also has a curious side-effect for TID offsets: They are one-based so reserving the zero bit would actually simplify things: getting rid of the +1/-1 logic when converting bits to/from offsets.

In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.

> > I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout.

> If we want to use the node kinds used in the paper, I think we should
> change the number in RT_NODE_KIND_X too.

Oh absolutely, this is nowhere near ready for cosmetic review :-)

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Implement generalized sub routine find_in_log for tap test
Next
From: "Wei Wang (Fujitsu)"
Date:
Subject: RE: Support logical replication of DDLs