Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id 20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> new file mode 100644
> index 0000000000..bf87f932fd
> --- /dev/null
> +++ b/src/backend/lib/radixtree.c
> @@ -0,0 +1,1763 @@
> +/*-------------------------------------------------------------------------
> + *
> + * radixtree.c
> + *        Implementation for adaptive radix tree.
> + *
> + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> + * Neumann, 2013.
> + *
> + * There are some differences from the proposed implementation.  For instance,
> + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> + * Also, there is no support for path compression and lazy path expansion. The
> + * radix tree supports fixed length of the key so we don't expect the tree level
> + * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.


> + * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.


> +/*
> + * As we descend a radix tree, we push the node to the stack. The stack is used
> + * at deletion.
> + */
> +typedef struct radix_tree_stack_data
> +{
> +    radix_tree_node *node;
> +    struct radix_tree_stack_data *parent;
> +} radix_tree_stack_data;
> +typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

> +/*
> + * Allocate a new node with the given node kind.
> + */
> +static radix_tree_node *
> +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> +{
> +    radix_tree_node *newnode;
> +
> +    newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> +                                                         radix_tree_node_info[kind].size);
> +    newnode->kind = kind;
> +
> +    /* update the statistics */
> +    tree->mem_used += GetMemoryChunkSpace(newnode);
> +    tree->cnt[kind]++;
> +
> +    return newnode;
> +}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?


> +                    else if (idx != n4->n.count)
> +                    {
> +                        /*
> +                         * the key needs to be inserted in the middle of the
> +                         * array, make space for the new key.
> +                         */
> +                        memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> +                                sizeof(uint8) * (n4->n.count - idx));
> +                        memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> +                                sizeof(radix_tree_node *) * (n4->n.count - idx));
> +                    }

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.


> +/*
> + * Insert the key with the val.
> + *
> + * found_p is set to true if the key already present, otherwise false, if
> + * it's not NULL.
> + *
> + * XXX: do we need to support update_if_exists behavior?
> + */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.


> +void
> +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> +{
> +    int            shift;
> +    bool        replaced;
> +    radix_tree_node *node;
> +    radix_tree_node *parent = tree->root;
> +
> +    /* Empty tree, create the root */
> +    if (!tree->root)
> +        radix_tree_new_root(tree, key, val);
> +
> +    /* Extend the tree if necessary */
> +    if (key > tree->max_val)
> +        radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Justin Pryzby
Date:
Subject: Re: doc: BRIN indexes and autosummarize