Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoC8A0bUU+q7-k9hF7Dop3HuOUZR0x2X853oApEWErxU9A@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 0000000000..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * radixtree.c
> > + *           Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation.  For instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> > + * Also, there is no support for path compression and lazy path expansion. The
> > + * radix tree supports fixed length of the key so we don't expect the tree level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > +     radix_tree_node *node;
> > +     struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?

Yes, we can do that.

>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > +     radix_tree_node *newnode;
> > +
> > +     newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> > +
radix_tree_node_info[kind].size);
> > +     newnode->kind = kind;
> > +
> > +     /* update the statistics */
> > +     tree->mem_used += GetMemoryChunkSpace(newnode);
> > +     tree->cnt[kind]++;
> > +
> > +     return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?

Indeed. I'll use MemoryContextMemAllocated instead.

>
>
> > +                                     else if (idx != n4->n.count)
> > +                                     {
> > +                                             /*
> > +                                              * the key needs to be inserted in the middle of the
> > +                                              * array, make space for the new key.
> > +                                              */
> > +                                             memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> > +                                                             sizeof(uint8) * (n4->n.count - idx));
> > +                                             memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> > +                                                             sizeof(radix_tree_node *) * (n4->n.count - idx));
> > +                                     }
>
> Maybe we could add a static inline helper for these memmoves? Both because
> it's repetitive (for different node types) and because the last time I looked
> gcc was generating quite bad code for this. And having to put workarounds into
> multiple places is obviously worse than having to do it in one place.

Agreed, I'll update it.

>
>
> > +/*
> > + * Insert the key with the val.
> > + *
> > + * found_p is set to true if the key already present, otherwise false, if
> > + * it's not NULL.
> > + *
> > + * XXX: do we need to support update_if_exists behavior?
> > + */
>
> Yes, I think that's needed - hence using bfm_set() instead of insert() in the
> prototype.

Agreed.

>
>
> > +void
> > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > +{
> > +     int                     shift;
> > +     bool            replaced;
> > +     radix_tree_node *node;
> > +     radix_tree_node *parent = tree->root;
> > +
> > +     /* Empty tree, create the root */
> > +     if (!tree->root)
> > +             radix_tree_new_root(tree, key, val);
> > +
> > +     /* Extend the tree if necessary */
> > +     if (key > tree->max_val)
> > +             radix_tree_extend(tree, key);
>
> FWIW, the reason I used separate functions for these in the prototype is that
> it turns out to generate a lot better code, because it allows non-inlined
> function calls to be sibling calls - thereby avoiding the need for a dedicated
> stack frame. That's not possible once you need a palloc or such, so splitting
> off those call paths into dedicated functions is useful.

Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: [PATCH] Add result_types column to pg_prepared_statements view
Next
From: Masahiko Sawada
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum