Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b+9svxpVFnJs08BmQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
I wrote:
> the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.

Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).

0001

This combines a few concepts that I didn't bother separating out after the fact:
- Split insert_impl.h into multiple functions for improved readability and maintainability.
- Use single-value leaves as the basis for storing values, with the goal to get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns the address of the slot. This allows the same interface whether the slot contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into recursive ones. This is a more natural way to traverse a tree structure, and we already see an advantage: Previously when growing a node, we searched within the parent to update its reference to the new node, because we didn't know the slot we descended from. Now we can simply update a single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to be stored in any node -- only the "top-level" start shift is stored in the tree control struct. This was easy to code since the node's shift value was hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.

0002

Back in v15, we tried keeping DSA/local pointers as members of a struct. I did not like the result, but still thought it was a good idea. RT_DELETE is a complex function and I didn't want to try rewriting it without a pointer abstraction, so I've resurrected this idea, but in a simpler, less intrusive way. A key difference from v15 is using a union type for the non-shmem case.

0004

Rewrite RT_DELETE using recursion. I find this simpler than the previous open-coded stack.

0005-06

Deletion has an inefficiency: One function searches for the child to see if it's there, then another function searches for it again to delete it. Since 0001, a successful child search returns the address of the slot, so we can save it. For the two smaller "linear search" node kinds we can then use a single subtraction to compute the chunk/slot index for deletion. Also, split RT_NODE_DELETE_INNER into separate functions, for a similar reason as the insert case in 0001.

0007

Anticipate node shrinking: If only one node-kind needs to be freed, we can move a branch to that one code path, rather than every place where RT_FREE is inlined.

0009

Teach node256 how to shrink *. Since we know the number of children in a node256 can't possibly be zero, we can use uint8 to store the count and interpret an overflow to zero as 256 for this node. The node header shrinks from 4 bytes to 3.

* Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)

0010

Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.

What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).

I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is for readability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes (including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures, it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classes similar to these four nominal kinds.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: PG 16 draft release notes ready
Next
From: Michael Paquier
Date:
Subject: Re: RFI: Extending the TOAST Pointer