Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoA_VjvG6LDYvJyuezNPtH12bx0My5Ws88o1Gu4h9qX5GQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > It looks no problem in terms of vacuum integration, although I've not
> > fully tested yet. TID store uses the radix tree as the main storage,
> > and with the template radix tree, the data types for shared and
> > non-shared will be different. TID store can have an union for the
> > radix tree and the structure would be like follows:
>
> >     /* Storage for Tids */
> >     union tree
> >     {
> >         local_radix_tree    *local;
> >         shared_radix_tree   *shared;
> >     };
>
> We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We
wouldhave to be more careful of things like the 32-bit crash from a few weeks ago. 

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
    bool    is_shared;
    MemoryContext context;
};

typedef struct rt_shared
{
    rt_handle   handle;
    uint32      magic;

    /* Root node */
   dsa_pointer  root;

    uint64      max_val;
    uint64      num_keys;

    /* need a lwlock */

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;

struct radix_tree_shared
{
    radix_tree rt;

    rt_shared *shared;
    dsa_area *area;
} radix_tree_shared;

struct radix_tree_local
{
    radix_tree rt;

    uint64  max_val;
    uint64  num_keys;

    rt_node *root;

    /* used only when the radix tree is private */
    MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
    MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;

>
> > In the functions of TID store, we need to call either local or shared
> > radix tree functions depending on whether TID store is shared or not.
> > We need if-branch for each key-value pair insertion, but I think it
> > would not be a big performance problem in TID store use cases, since
> > vacuum is an I/O intensive operation in many cases.
>
> Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and
fattercode paths. 
>
> > Overall, I think
> > there is no problem and I'll investigate it in depth.
>
> Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach
forshared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to
easeadding variable-length keys if someone finds a need. 

I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

>
> > Apart from that, I've been considering the lock support for shared
> > radix tree. As we discussed before, the current usage (i.e, only
> > parallel index vacuum) doesn't require locking support at all, so it
> > would be enough to have a single lock for simplicity.
>
> Right, that should be enough for PG16.
>
> > If we want to
> > use the shared radix tree for other use cases such as the parallel
> > heap vacuum or the replacement of the hash table for shared buffers,
> > we would need better lock support.
>
> For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays.
Highlyconcurrent applications will need additional work, of course. 
>
> > For example, if we want to support
> > Optimistic Lock Coupling[1],
>
> Interesting, from the same authors!

+1

>
> > we would need to change not only the node
> > structure but also the logic. Which probably leads to widen the gap
> > between the code for non-shared and shared radix tree. In this case,
> > once we have a better radix tree optimized for shared case, perhaps we
> > can replace the templated shared radix tree with it. I'd like to hear
> > your opinion on this line.
>
> I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the
localimplementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes. 

>
> > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type
--it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take
uponly 32 bytes with a fanout of 3, by eliminating padding. 
> >
> > Oh, I didn't notice that. The chunk field was originally used when
> > redirecting the child pointer in the parent node from old to new
> > (grown) node. When redirecting the pointer, since the corresponding
> > chunk surely exists on the parent we can skip existence checks.
> > Currently we use RT_NODE_UPDATE_INNER() for that (see
> > RT_REPLACE_NODE()) but having a dedicated function to update the
> > existing chunk and child pointer might improve the performance. Or
> > reducing the node size by getting rid of the chunk field might be
> > better.
>
> I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift"
fromL1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in
growinga node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production
builds,I would hope that branch gets elided (I haven't checked). 
>
> I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path
expansion),and this seems the way to get there. 

Okay, let's get rid of that in the v18.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Brar Piening
Date:
Subject: Re: pgsql: Doc: add XML ID attributes to and tags.
Next
From: Michael Paquier
Date:
Subject: Re: Add a new pg_walinspect function to extract FPIs from WAL records