Thread: Re: Fix crash when non-creator being an iteration on shared radix tree

Re: Fix crash when non-creator being an iteration on shared radix tree

From

John Naylor

Date:

18 December 2024, 10:12:13

On Wed, Dec 18, 2024 at 12:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> I found that a server crashes due to a null-pointer-dereference if a
> process attached to the shared radix tree begins an iteration on it,
> because we don't create the memory context for iter_context at
> RT_ATTACH(). There is no code in the core causing this crash in the
> core since in parallel vacuum, the leader process always creates the
> shared radix tree and begins the iteration. However it could happen in
> external extensions. I've attached the patch to fix it and I think it
> should be backpatched to v17.

+1 in general, but I wonder if instead the iter_context should be
created within RT_BEGIN_ITERATE -- I imagine that would have less
duplication and would be as safe, but I haven't tried it. Is there
some reason not  to do that?

--
John Naylor
Amazon Web Services

Re: Fix crash when non-creator being an iteration on shared radix tree

From

John Naylor

Date:

19 December 2024, 09:32:46

On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > +1 in general, but I wonder if instead the iter_context should be
> > created within RT_BEGIN_ITERATE -- I imagine that would have less
> > duplication and would be as safe, but I haven't tried it. Is there
> > some reason not  to do that?
>
> I agree that it has less duplication. There is no strong reason I
> didn't do that. I just didn't want to check 'if (!tree->iter_context)'
> in RT_BEGIN_ITERATE for simplicity. I've changed the patch
> accordingly.

I see what you mean. For v17, a bit of duplication is probably worth
it for simplicity, so I'd say v1 is fine there.

However, I think on master we should reconsider some aspects of memory
management more broadly:

1. The creator allocates the root of the tree in a new child context,
but an attaching process allocates it in its current context, and we
pfree it when the caller wants to detach. It seems like we could
always allocate this small struct in CurrentMemoryContext for
consistency.

2. The iter_context is separate because the creator's new context
could be a bump context which doesn't support pfree. But above we
assume we can pfree in the caller's context. Also, IIUC we only
allocate small iter objects, and it'd be unusual to need more than one
at a time per backend, so it's a bit strange to have an entire context
for that. Since we use a standard pattern of "begin; while(iter);
end;", it seems unlikely that someone will cause a leak because of a
coding mistake in iteration.

If these tiny admin structs were always, not sometimes, in the callers
current context, I think it would be easier to reason about because
then the creator's passed context would be used only for local memory,
specifically only for leaves and the inner node child contexts.
Thoughts?

Further,

3. I was never a fan of trying to second-guess the creator's new
context and instead use slab for fixed-sized leaf allocations. If the
creator passes a bump context, we say "no, no, no, use slab -- it's
good for ya". Let's assume the caller knows what they're doing.

4. For local memory, an allocated "control object" serves no real
purpose and wastes a few cycles on every access. I'm not sure it
matters that much as a future micro-optimization, but I mention it
here because if we did start allocating outer structs in the callers
context, embedding would also remove the need to pfree it.

--
John Naylor
Amazon Web Services

Re: Fix crash when non-creator being an iteration on shared radix tree

From

Masahiko Sawada

Date:

20 December 2024, 00:12:02

On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > > +1 in general, but I wonder if instead the iter_context should be
> > > created within RT_BEGIN_ITERATE -- I imagine that would have less
> > > duplication and would be as safe, but I haven't tried it. Is there
> > > some reason not  to do that?
> >
> > I agree that it has less duplication. There is no strong reason I
> > didn't do that. I just didn't want to check 'if (!tree->iter_context)'
> > in RT_BEGIN_ITERATE for simplicity. I've changed the patch
> > accordingly.
>
> I see what you mean. For v17, a bit of duplication is probably worth
> it for simplicity, so I'd say v1 is fine there.
>
> However, I think on master we should reconsider some aspects of memory
> management more broadly:
>
> 1. The creator allocates the root of the tree in a new child context,
> but an attaching process allocates it in its current context, and we
> pfree it when the caller wants to detach. It seems like we could
> always allocate this small struct in CurrentMemoryContext for
> consistency.
>
> 2. The iter_context is separate because the creator's new context
> could be a bump context which doesn't support pfree. But above we
> assume we can pfree in the caller's context. Also, IIUC we only
> allocate small iter objects, and it'd be unusual to need more than one
> at a time per backend, so it's a bit strange to have an entire context
> for that. Since we use a standard pattern of "begin; while(iter);
> end;", it seems unlikely that someone will cause a leak because of a
> coding mistake in iteration.
>
> If these tiny admin structs were always, not sometimes, in the callers
> current context, I think it would be easier to reason about because
> then the creator's passed context would be used only for local memory,
> specifically only for leaves and the inner node child contexts.
> Thoughts?

Fair points. Given that we need only one iterator at a time per
backend, it would be simpler if the caller passes the pointer to an
iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
TidStoreBeginIterate() would be like:

if (TidStoreIsShared(ts))
    shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
else
   local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

>
> Further,
>
> 3. I was never a fan of trying to second-guess the creator's new
> context and instead use slab for fixed-sized leaf allocations. If the
> creator passes a bump context, we say "no, no, no, use slab -- it's
> good for ya". Let's assume the caller knows what they're doing.

That's a valid argument but how can a user use the slab context for
leaf allocations? If the caller passes an allocset context to
RT_CREATE(), it still makes sense to usa slab context for leaf
allocation in terms of avoiding possible space wasting.

> 4. For local memory, an allocated "control object" serves no real
> purpose and wastes a few cycles on every access. I'm not sure it
> matters that much as a future micro-optimization, but I mention it
> here because if we did start allocating outer structs in the callers
> context, embedding would also remove the need to pfree it.

Using an allocated "control object" can simplify the codes for local
and shared trees. We cannot embed the control object into
RT_RADIX_TREE in shared cases. I agree to embed the control data if we
can implement that cleanly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix crash when non-creator being an iteration on shared radix tree

From

Masahiko Sawada

Date:

20 December 2024, 22:16:52

On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > 2. The iter_context is separate because the creator's new context
> > > could be a bump context which doesn't support pfree. But above we
> > > assume we can pfree in the caller's context. Also, IIUC we only
> > > allocate small iter objects, and it'd be unusual to need more than one
> > > at a time per backend, so it's a bit strange to have an entire context
> > > for that. Since we use a standard pattern of "begin; while(iter);
> > > end;", it seems unlikely that someone will cause a leak because of a
> > > coding mistake in iteration.
>
> v3-0001 allocates the iter data in the caller's context. It's a small
> patch, but still a core behavior change so proposed for master-only. I
> believe your v1 is still the least invasive fix for PG17.

I agree to use v1 for v17.

>
> > > If these tiny admin structs were always, not sometimes, in the callers
> > > current context, I think it would be easier to reason about because
> > > then the creator's passed context would be used only for local memory,
> > > specifically only for leaves and the inner node child contexts.
>
> 0002 does this.
>
> > Fair points. Given that we need only one iterator at a time per
> > backend, it would be simpler if the caller passes the pointer to an
> > iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
> > TidStoreBeginIterate() would be like:
> >
> > if (TidStoreIsShared(ts))
> >     shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
> > else
> >    local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);
>
> Hard for me to tell if it'd be simpler.
>
> > > 3. I was never a fan of trying to second-guess the creator's new
> > > context and instead use slab for fixed-sized leaf allocations. If the
> > > creator passes a bump context, we say "no, no, no, use slab -- it's
> > > good for ya". Let's assume the caller knows what they're doing.
> >
> > That's a valid argument but how can a user use the slab context for
> > leaf allocations?
>
> It's trivial after 0001-02: 0003 removes makes one test use slab as
> the passed context (only 32-bit systems would actually use it
> currently).

These changes make sense to me. Here are a few comments:

RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
local memory. Do we want to declare only in the !RT_SHMEM case?

---
/*
 * Similar to TidStoreCreateLocal() but create a shared TidStore on a
 * DSA area. The TID storage will live in the DSA area, and the memory
 * context rt_context will have only meta data of the radix tree.
 *
 * The returned object is allocated in backend-local memory.
 */

We need to update the comment about rt_context too since we no longer
use rt_context for shared tidstore.

>
> Also, with a bit more work we could allow a NULL context for when the
> caller has purposely arranged to use pointer-sized values. Did you see
> any of Heikki's CSN patches? There is a radix tree used as a cache in
> a context where the tree could be created and destroyed frequently.
> Something about the memory blocks seems to have tickled some bad case
> in the glibc allocator, and one less context might be good insurance:
>
> https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi

Will check these patches. Thanks!

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix crash when non-creator being an iteration on shared radix tree

From

Masahiko Sawada

Date:

26 December 2024, 19:59:20

On Fri, Dec 20, 2024 at 9:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Sat, Dec 21, 2024 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > v3-0001 allocates the iter data in the caller's context. It's a small
> > > patch, but still a core behavior change so proposed for master-only. I
> > > believe your v1 is still the least invasive fix for PG17.
> >
> > I agree to use v1 for v17.
>
> Okay, did you want to commit that separately, or together with my 0001
> on master? Either way, I've put a bit more effort into the commit
> message in v4 and adjusted the comment slightly.

I think we can commit them separately. It's a pure bug fix for back
branches while for HEAD it removes the bug by an improvement.

>
> > > It's trivial after 0001-02: 0003 removes makes one test use slab as
> > > the passed context (only 32-bit systems would actually use it
> > > currently).
> >
> > These changes make sense to me. Here are a few comments:
> >
> > RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
> > local memory. Do we want to declare only in the !RT_SHMEM case?
>
> That's already the case, if I understand your statement correctly.

True. Sorry, I misunderstood something.

>
> > ---
> > /*
> >  * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> >  * DSA area. The TID storage will live in the DSA area, and the memory
> >  * context rt_context will have only meta data of the radix tree.
> >  *
> >  * The returned object is allocated in backend-local memory.
> >  */
> >
> > We need to update the comment about rt_context too since we no longer
> > use rt_context for shared tidstore.
>
> Fixed.

The three patches look good to me.

> BTW, it seems TidStore.context is unused?

Indeed. We can remove it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix crash when non-creator being an iteration on shared radix tree

From

John Naylor

Date:

06 January, 09:40:39

On Thu, Dec 26, 2024 at 11:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The three patches look good to me.

Thanks for looking! I've pushed them all.

(The failure in drongo seems like an unrelated glitch.)

> > 4. For local memory, an allocated "control object" serves no real
> > purpose and wastes a few cycles on every access. I'm not sure it
> > matters that much as a future micro-optimization, but I mention it
> > here because if we did start allocating outer structs in the callers
> > context, embedding would also remove the need to pfree it.
>
> Using an allocated "control object" can simplify the codes for local
> and shared trees. We cannot embed the control object into
> RT_RADIX_TREE in shared cases. I agree to embed the control data if we
> can implement that cleanly.

I tried this and it was fairly trivial to get working, but it didn't
have the intended effect so I'll leave it alone.

--
John Naylor
Amazon Web Services