Home > mailing lists

Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date	July 5, 2022 07:33:29
Msg-id	CAD21AoAvyKUqMMHLt6A=g4GOBu9SNOprMP5nZumFP40HUJgnxA@mail.gmail.com Whole thread
In response to	Re: [PoC] Improve dead tuple storage for lazy vacuum (Andres Freund <andres@anarazel.de>)
Responses	Re: [PoC] Improve dead tuple storage for lazy vacuum
List	pgsql-hackers

Tree view

On Tue, Jul 5, 2022 at 7:00 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Yea, at some point the compiler starts using a jump table instead of branches,
> and that turns out to be a good bit more expensive. And even with branches, it
> obviously adds hard to predict branches. IIRC I fought a bit with the compiler
> to avoid some of that cost, it's possible that got "lost" in Sawada-san's
> patch.
>
>
> Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
> one is not unimportant until we have path compression.

I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.

>
> Right now the node struct sizes are:
> 4 - 48 bytes
> 32 - 296 bytes
> 128 - 1304 bytes
> 256 - 2088 bytes
>
> I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
> bytes, but needing that separate isset array somehow is sad :/. I wonder if a
> smaller "free index" would do the trick? Point to the element + 1 where we
> searched last and start a plain loop there. Particularly in an insert-only
> workload that'll always work, and in other cases it'll still often work I
> think.

radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.


> One thing I was wondering about is trying to choose node types in
> roughly-power-of-two struct sizes. It's pretty easy to end up with significant
> fragmentation in the slabs right now when inserting as you go, because some of
> the smaller node types will be freed but not enough to actually free blocks of
> memory. If we instead have ~power-of-two sizes we could just use a single slab
> of the max size, and carve out the smaller node types out of that largest
> allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

How about using different block size for different node types?

>
> Btw, that fragmentation is another reason why I think it's better to track
> memory usage via memory contexts, rather than doing so based on
> GetMemoryChunkSpace().

Agreed.

>
>
> > > Ideally, node16 and node32 would have the same code with a different
> > > loop count (1 or 2). More generally, there is too much duplication of
> > > code (noted by Andres in his PoC), and there are many variable names
> > > with the node size embedded. This is a bit tricky to make more
> > > general, so we don't need to try it yet, but ideally we would have
> > > something similar to:
> > >
> > > switch (node->kind) // todo: inspect tagged pointer
> > > {
> > >   case RADIX_TREE_NODE_KIND_4:
> > >        idx = node_search_eq(node, chunk, 4);
> > >        do_action(node, idx, 4, ...);
> > >        break;
> > >   case RADIX_TREE_NODE_KIND_32:
> > >        idx = node_search_eq(node, chunk, 32);
> > >        do_action(node, idx, 32, ...);
> > >   ...
> > > }
>
> FWIW, that should be doable with an inline function, if you pass it the memory
> to the "array" rather than the node directly. Not so sure it's a good idea to
> do dispatch between node types / search methods inside the helper, as you
> suggest below:
>
>
> > > static pg_alwaysinline void
> > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > > {
> > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> > >   // do simple loop with (node_simple *) node;
> > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> > >   // do vectorized loop where available with (node_vec *) node;
> > > ...
> > > }

Yeah, It's worth trying at some point.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

pgsql-hackers by date:

From: Masahiko Sawada
Date: 05 July 2022, 07:33:17
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Amit Langote
Date: 05 July 2022, 07:36:08
Subject: Re: doc phrase: "inheritance child"

Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

Previous

Next