Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoB+z3+T2EnOBVXok2aeRywrDZDFkwwEaWKZSrFz9s1+mw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking
againat the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum
isa good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like
thisexample: 
>
> node4:   4 +  4           +  4*8 =   40
> node4:   5 +  4+(7)       +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes
fromkind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded
ina pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might
be6 temporarily if I work on size decoupling first). 

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need
towrite your own shifting/masking code). 

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently
withthe above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still
besome performance tuning and cosmetic work, but it's getting closer. 
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -------------------------
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off
writingtheir own byte-wise loop rather than using this API. Especially since the "min" function exists only as a
workaroundfor lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that
idiomfor non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm. 

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is greater
> + * than or equal to 'key'.  Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* insert to the tail */
>
> It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for
example).In fact, these functions are probably better named node*_get_insertpos(). 

Agreed.

>
> + if (likely(NODE_HAS_FREE_SLOT(n128)))
> + {
> + node_inner_128_insert(n128, chunk, child);
> + break;
> + }
> +
> + /* grow node from 128 to 256 */
>
> We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close
together.This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the
desiredeffect, so we need to write like this (also see prototype): 
>
> if (unlikely( ! has-free-slot))
>   grow-node;
> else
> {
>   ...;
>   break;
> }
> /* FALLTHROUGH */

Good point. Will change.

>
> + /* Descend the tree until a leaf node */
> + while (shift >= 0)
> + {
> +   rt_node    *child;
> +
> +   if (NODE_IS_LEAF(node))
> +     break;
> +
> +   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
> +     child = rt_node_add_new_child(tree, parent, node, key);
> +
> +   Assert(child);
> +
> +   parent = node;
> +   node = child;
> +   shift -= RT_NODE_SPAN;
> + }
>
> Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing
there(the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in
mindas we proceed. Maybe a comment about it to remind us. 

Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.

>
> + /* there is no key to delete */
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
> +   return false;
> +
> + /* Update the statistics */
> + tree->num_keys--;
> +
> + /*
> +  * Delete the key from the leaf node and recursively delete the key in
> +  * inner nodes if necessary.
> +  */
> + Assert(NODE_IS_LEAF(stack[level]));
> + while (level >= 0)
> + {
> +   rt_node    *node = stack[level--];
> +
> +   if (NODE_IS_LEAF(node))
> +     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
> +   else
> +     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
> +
> +   /* If the node didn't become empty, we stop deleting the key */
> +   if (!NODE_IS_EMPTY(node))
> +     break;
> +
> +   /* The node became empty */
> +   rt_free_node(tree, node);
> + }
>
> Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls
areinlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf
bookkeeppingand loop over the inner nodes. This might require some duplication of code. 

Agreed.

>
> +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
>
> Spelling

WIll fix.

>
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +             uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> gcc generates better code with something like this (but not hard-coded) at the top:
>
>     if (count > 4)
>         pg_unreachable();

Agreed.

>
> This would have to change when we implement shrinking of nodes, but might still be useful.
>
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
> +   return false;
> +
> + return true;
>
> Maybe just "return rt_node_search_leaf(...)" ?

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: GUC values - recommended way to declare the C variables?
Next
From: Justin Pryzby
Date:
Subject: Re: GUC values - recommended way to declare the C variables?