Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBKFiTP5YkTjP=r8LGcaUDnX+x48UE9Ta7=njKmip9s0Q@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > So it seems that there are two candidates of rt_node structure: (1)
> > all nodes except for node256 are variable-size nodes and use pointer
> > tagging, and (2) node32 and node128 are variable-sized nodes and do
> > not use pointer tagging (fanout is in part of only these two nodes).
> > rt_node can be 5 bytes in both cases. But before going to this step, I
> > started to verify the idea of variable-size nodes by using 6-bytes
> > rt_node. We can adjust the node kinds and node classes later.
>
> First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs
below.)
>
> Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in
place,except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared
memorycoding looked, and that is the only sticky problem we still have, IMO. The rest is "just work". 
>
> That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as
needed.
>
> > Overall, the idea of variable-sized nodes is good, smaller size
> > without losing search performance.
>
> Good.
>
> > I'm going to check the load
> > performance as well.
>
> Part of that is this, which gets called a lot more now, when node1 expands:
>
> + if (inner)
> + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
> + rt_node_kind_info[kind].inner_size);
> + else
> + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
> + rt_node_kind_info[kind].leaf_size);
>
> Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When
compilingMemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes
forlarge sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for
smallsizes. The "init" function should be always inline with const parameters where possible. That way, memset can
compileto a single instruction for the smallest node kind. (More on alloc/init below) 

Right. I forgot to update it.

>
> Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so
whenexpanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought
ofthis while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than
special-casingthis, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array,
butaccessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves),
butI'll leave the details for later. 

Agreed, I start with zeroing out the node when expanding from partial
to full size.

> Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:
>
> node61:  6 + 256+(2) +16 +  61*8 =  768
> node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

>
> > I've attached the patches I used for the verification. I don't include
> > patches for pointer tagging, DSA support, and vacuum integration since
> > I'm investigating the issue on cfbot that Andres reported. Also, I've
> > modified tests to improve the test coverage.
>
> Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some
additionalcomments: 
>
> +/* Return a new and initialized node */
> +static rt_node *
> +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
> +{
> + rt_node *newnode;
> +
> + newnode = rt_alloc_node(tree, kind, inner);
> + rt_init_node(newnode, kind, shift, chunk, inner);
> +
> + return newnode;
> +}
>
> I don't see the point of a function that just calls two functions.

Removed.

>
> +/*
> + * Create a new node with 'new_kind' and the same shift, chunk, and
> + * count of 'node'.
> + */
> +static rt_node *
> +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
> +{
> + rt_node    *newnode;
> +
> + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
> + node->shift > 0);
> + newnode->count = node->count;
> +
> + return newnode;
> +}
>
> This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This
functionshould really be alloc-node + init-node + copy-common, where copy-common is like in the prototype: 
> + newnode->node_shift = oldnode->node_shift;
> + newnode->node_chunk = oldnode->node_chunk;
> + newnode->count = oldnode->count;
>
> And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and
"chunk".The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set
those,too -- it might even improve readability. 
>
> -       if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
> +       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

Agreed.

>
> This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else"
branchmeans "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I
thinkit'd be more effective to improve the lines that follow: 
>
> + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
> + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
>
> Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it,
thesearrays should be const so the compiler can avoid runtime lookups. Speaking of... 
>
> +/* Copy both chunks and children/values arrays */
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +  uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + /* For better code generation */
> + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
> + pg_unreachable();
> +
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first
place?This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to
memcpyshould evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant
shouldbe correct, IIUC? 

Right. We don't need to pass count to these functions.

>
> - .fanout = 256,
> + /* technically it's 256, but we can't store that in a uint8,
> +  and this is the max size class so it will never grow */
> + .fanout = 0,
>
> - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
> + Assert(((rt_node *) n256)->fanout == 0);
> + Assert(chunk_exists || ((rt_node *) n256)->count < 256);
>
> These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for
fixed-and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for
node256.That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. 

Since the node has fanout regardless of fixed-sized and
variable-sized, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?

#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
  (node->base.n.count < rt_size_class_info[class].fanout)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Introduce a new view for checkpointer related stats
Next
From: ZHU XIAN WEN
Date:
Subject: Re: Large Pages and Super Pages for PostgreSQL