Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoC+tKRwgnnL47vGOnk_C_8WFupgGo4_U=+JLTy7QACSQA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <johncnaylorls@gmail.com>)
List pgsql-hackers
On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Since the variable-length values support is a big deal and would be
> > related to API design I'd like to discuss the API design first.
>
> Thanks for the fine summary of the issues here.
>
> [Swapping this back in my head]
>
> > RT_VALUE_TYPE
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> > or for variable-length value support,
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If an entry already exists, return its pointer and set "found" to
> > true. Otherwize, insert an empty value with sz bytes, return its
> > pointer, and set "found" to false.
>
> > ---
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> I'd have to double-check, but I think RT_SET is vestigial and I'm not
> sure it has any advantage over RT_GET as I've sketched it out. I'm
> pretty sure it's only there now because changing the radix tree
> regression tests is much harder than changing TID store.

Agreed.

>
> > Given variable-length value support, RT_GET() would have to do
> > repalloc() if the existing value size is not big enough for the new
> > value, but it cannot as the radix tree doesn't know the size of each
> > stored value.
>
> I think we have two choices:
>
> - the value stores the "length". The caller would need to specify a
> function to compute size from the "length" member. Note this assumes
> there is an array. I think both aspects are not great.
> - the value stores the "size". Callers that store an array (as
> PageTableEntry's do) would compute length when they need to. This
> sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

>
> > Another idea is that the radix tree returns the pointer
> > to the slot and the caller updates the value accordingly.
>
> I did exactly this in v43 TidStore if I understood you correctly. If I
> misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

>
> > But it means
> > that the caller has to update the slot properly while considering the
> > value size (embedded vs. single-leave value), which seems not a good
> > idea.
>
> For this optimization, callers will have to know about pointer-sized
> values and treat them differently, but they don't need to know the
> details about how where they are stored.
>
> While we want to keep embedded values in the back of our minds, I
> really think the details should be postponed to a follow-up commit.

Agreed.

>
> > To deal with this problem, I think we can somewhat change RT_GET() API
> > as follow:
> >
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> For the case we are discussing, bitmaps, updating an existing value is
> a bit tricky. We need the existing value to properly update it with
> set or unset bits. This can't work in general without a lot of work
> for the caller.

True.

>
> However, for vacuum, we have all values that we need up front. That
> gives me an idea: Something like this insert API could be optimized
> for "insert-only": If we only free values when we free the whole tree
> at the end, that's a clear use case for David Rowley's proposed "bump
> context", which would save 8 bytes per allocation and be a bit faster.
> [1] (RT_GET for varlen values would use an aset context, to allow
> repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Emitting JSON to file using COPY TO
Next
From: Joe Conway
Date:
Subject: Re: Emitting JSON to file using COPY TO