Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAD21AoC8A0bUU+q7-k9hF7Dop3HuOUZR0x2X853oApEWErxU9A@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote: > > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c > > new file mode 100644 > > index 0000000000..bf87f932fd > > --- /dev/null > > +++ b/src/backend/lib/radixtree.c > > @@ -0,0 +1,1763 @@ > > +/*------------------------------------------------------------------------- > > + * > > + * radixtree.c > > + * Implementation for adaptive radix tree. > > + * > > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful > > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas > > + * Neumann, 2013. > > + * > > + * There are some differences from the proposed implementation. For instance, > > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit > > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. > > + * Also, there is no support for path compression and lazy path expansion. The > > + * radix tree supports fixed length of the key so we don't expect the tree level > > + * wouldn't be high. > > I think we're going to need path compression at some point, fwiw. I'd bet on > it being beneficial even for the tid case. > > > > + * The key is a 64-bit unsigned integer and the value is a Datum. > > I don't think it's a good idea to define the value type to be a datum. A datum value is convenient to represent both a pointer and a value so I used it to avoid defining node types for inner and leaf nodes separately. Since a datum could be 4 bytes or 8 bytes depending it might not be good for some platforms. But what kind of aspects do you not like the idea of using datum? > > > > +/* > > + * As we descend a radix tree, we push the node to the stack. The stack is used > > + * at deletion. > > + */ > > +typedef struct radix_tree_stack_data > > +{ > > + radix_tree_node *node; > > + struct radix_tree_stack_data *parent; > > +} radix_tree_stack_data; > > +typedef radix_tree_stack_data *radix_tree_stack; > > I think it's a very bad idea for traversal to need allocations. I really want > to eventually use this for shared structures (eventually with lock-free > searches at least), and needing to do allocations while traversing the tree is > a no-go for that. > > Particularly given that the tree currently has a fixed depth, can't you just > allocate this on the stack once? Yes, we can do that. > > > +/* > > + * Allocate a new node with the given node kind. > > + */ > > +static radix_tree_node * > > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) > > +{ > > + radix_tree_node *newnode; > > + > > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], > > + radix_tree_node_info[kind].size); > > + newnode->kind = kind; > > + > > + /* update the statistics */ > > + tree->mem_used += GetMemoryChunkSpace(newnode); > > + tree->cnt[kind]++; > > + > > + return newnode; > > +} > > Why are you tracking the memory usage at this level of detail? It's *much* > cheaper to track memory usage via the memory contexts? Since they're dedicated > for the radix tree, that ought to be sufficient? Indeed. I'll use MemoryContextMemAllocated instead. > > > > + else if (idx != n4->n.count) > > + { > > + /* > > + * the key needs to be inserted in the middle of the > > + * array, make space for the new key. > > + */ > > + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]), > > + sizeof(uint8) * (n4->n.count - idx)); > > + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]), > > + sizeof(radix_tree_node *) * (n4->n.count - idx)); > > + } > > Maybe we could add a static inline helper for these memmoves? Both because > it's repetitive (for different node types) and because the last time I looked > gcc was generating quite bad code for this. And having to put workarounds into > multiple places is obviously worse than having to do it in one place. Agreed, I'll update it. > > > > +/* > > + * Insert the key with the val. > > + * > > + * found_p is set to true if the key already present, otherwise false, if > > + * it's not NULL. > > + * > > + * XXX: do we need to support update_if_exists behavior? > > + */ > > Yes, I think that's needed - hence using bfm_set() instead of insert() in the > prototype. Agreed. > > > > +void > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > > +{ > > + int shift; > > + bool replaced; > > + radix_tree_node *node; > > + radix_tree_node *parent = tree->root; > > + > > + /* Empty tree, create the root */ > > + if (!tree->root) > > + radix_tree_new_root(tree, key, val); > > + > > + /* Extend the tree if necessary */ > > + if (key > tree->max_val) > > + radix_tree_extend(tree, key); > > FWIW, the reason I used separate functions for these in the prototype is that > it turns out to generate a lot better code, because it allows non-inlined > function calls to be sibling calls - thereby avoiding the need for a dedicated > stack frame. That's not possible once you need a palloc or such, so splitting > off those call paths into dedicated functions is useful. Thank you for the info. How much does using sibling call optimization help the performance in this case? I think that these two cases are used only a limited number of times: inserting the first key and extending the tree. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: