Thread: Re: Fix crash when non-creator being an iteration on shared radix tree
On Wed, Dec 18, 2024 at 12:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > I found that a server crashes due to a null-pointer-dereference if a > process attached to the shared radix tree begins an iteration on it, > because we don't create the memory context for iter_context at > RT_ATTACH(). There is no code in the core causing this crash in the > core since in parallel vacuum, the leader process always creates the > shared radix tree and begins the iteration. However it could happen in > external extensions. I've attached the patch to fix it and I think it > should be backpatched to v17. +1 in general, but I wonder if instead the iter_context should be created within RT_BEGIN_ITERATE -- I imagine that would have less duplication and would be as safe, but I haven't tried it. Is there some reason not to do that? -- John Naylor Amazon Web Services
On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > +1 in general, but I wonder if instead the iter_context should be > > created within RT_BEGIN_ITERATE -- I imagine that would have less > > duplication and would be as safe, but I haven't tried it. Is there > > some reason not to do that? > > I agree that it has less duplication. There is no strong reason I > didn't do that. I just didn't want to check 'if (!tree->iter_context)' > in RT_BEGIN_ITERATE for simplicity. I've changed the patch > accordingly. I see what you mean. For v17, a bit of duplication is probably worth it for simplicity, so I'd say v1 is fine there. However, I think on master we should reconsider some aspects of memory management more broadly: 1. The creator allocates the root of the tree in a new child context, but an attaching process allocates it in its current context, and we pfree it when the caller wants to detach. It seems like we could always allocate this small struct in CurrentMemoryContext for consistency. 2. The iter_context is separate because the creator's new context could be a bump context which doesn't support pfree. But above we assume we can pfree in the caller's context. Also, IIUC we only allocate small iter objects, and it'd be unusual to need more than one at a time per backend, so it's a bit strange to have an entire context for that. Since we use a standard pattern of "begin; while(iter); end;", it seems unlikely that someone will cause a leak because of a coding mistake in iteration. If these tiny admin structs were always, not sometimes, in the callers current context, I think it would be easier to reason about because then the creator's passed context would be used only for local memory, specifically only for leaves and the inner node child contexts. Thoughts? Further, 3. I was never a fan of trying to second-guess the creator's new context and instead use slab for fixed-sized leaf allocations. If the creator passes a bump context, we say "no, no, no, use slab -- it's good for ya". Let's assume the caller knows what they're doing. 4. For local memory, an allocated "control object" serves no real purpose and wastes a few cycles on every access. I'm not sure it matters that much as a future micro-optimization, but I mention it here because if we did start allocating outer structs in the callers context, embedding would also remove the need to pfree it. -- John Naylor Amazon Web Services
On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > +1 in general, but I wonder if instead the iter_context should be > > > created within RT_BEGIN_ITERATE -- I imagine that would have less > > > duplication and would be as safe, but I haven't tried it. Is there > > > some reason not to do that? > > > > I agree that it has less duplication. There is no strong reason I > > didn't do that. I just didn't want to check 'if (!tree->iter_context)' > > in RT_BEGIN_ITERATE for simplicity. I've changed the patch > > accordingly. > > I see what you mean. For v17, a bit of duplication is probably worth > it for simplicity, so I'd say v1 is fine there. > > However, I think on master we should reconsider some aspects of memory > management more broadly: > > 1. The creator allocates the root of the tree in a new child context, > but an attaching process allocates it in its current context, and we > pfree it when the caller wants to detach. It seems like we could > always allocate this small struct in CurrentMemoryContext for > consistency. > > 2. The iter_context is separate because the creator's new context > could be a bump context which doesn't support pfree. But above we > assume we can pfree in the caller's context. Also, IIUC we only > allocate small iter objects, and it'd be unusual to need more than one > at a time per backend, so it's a bit strange to have an entire context > for that. Since we use a standard pattern of "begin; while(iter); > end;", it seems unlikely that someone will cause a leak because of a > coding mistake in iteration. > > If these tiny admin structs were always, not sometimes, in the callers > current context, I think it would be easier to reason about because > then the creator's passed context would be used only for local memory, > specifically only for leaves and the inner node child contexts. > Thoughts? Fair points. Given that we need only one iterator at a time per backend, it would be simpler if the caller passes the pointer to an iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example, TidStoreBeginIterate() would be like: if (TidStoreIsShared(ts)) shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared); else local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared); > > Further, > > 3. I was never a fan of trying to second-guess the creator's new > context and instead use slab for fixed-sized leaf allocations. If the > creator passes a bump context, we say "no, no, no, use slab -- it's > good for ya". Let's assume the caller knows what they're doing. That's a valid argument but how can a user use the slab context for leaf allocations? If the caller passes an allocset context to RT_CREATE(), it still makes sense to usa slab context for leaf allocation in terms of avoiding possible space wasting. > 4. For local memory, an allocated "control object" serves no real > purpose and wastes a few cycles on every access. I'm not sure it > matters that much as a future micro-optimization, but I mention it > here because if we did start allocating outer structs in the callers > context, embedding would also remove the need to pfree it. Using an allocated "control object" can simplify the codes for local and shared trees. We cannot embed the control object into RT_RADIX_TREE in shared cases. I agree to embed the control data if we can implement that cleanly. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > 2. The iter_context is separate because the creator's new context > > > could be a bump context which doesn't support pfree. But above we > > > assume we can pfree in the caller's context. Also, IIUC we only > > > allocate small iter objects, and it'd be unusual to need more than one > > > at a time per backend, so it's a bit strange to have an entire context > > > for that. Since we use a standard pattern of "begin; while(iter); > > > end;", it seems unlikely that someone will cause a leak because of a > > > coding mistake in iteration. > > v3-0001 allocates the iter data in the caller's context. It's a small > patch, but still a core behavior change so proposed for master-only. I > believe your v1 is still the least invasive fix for PG17. I agree to use v1 for v17. > > > > If these tiny admin structs were always, not sometimes, in the callers > > > current context, I think it would be easier to reason about because > > > then the creator's passed context would be used only for local memory, > > > specifically only for leaves and the inner node child contexts. > > 0002 does this. > > > Fair points. Given that we need only one iterator at a time per > > backend, it would be simpler if the caller passes the pointer to an > > iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example, > > TidStoreBeginIterate() would be like: > > > > if (TidStoreIsShared(ts)) > > shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared); > > else > > local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared); > > Hard for me to tell if it'd be simpler. > > > > 3. I was never a fan of trying to second-guess the creator's new > > > context and instead use slab for fixed-sized leaf allocations. If the > > > creator passes a bump context, we say "no, no, no, use slab -- it's > > > good for ya". Let's assume the caller knows what they're doing. > > > > That's a valid argument but how can a user use the slab context for > > leaf allocations? > > It's trivial after 0001-02: 0003 removes makes one test use slab as > the passed context (only 32-bit systems would actually use it > currently). These changes make sense to me. Here are a few comments: RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for local memory. Do we want to declare only in the !RT_SHMEM case? --- /* * Similar to TidStoreCreateLocal() but create a shared TidStore on a * DSA area. The TID storage will live in the DSA area, and the memory * context rt_context will have only meta data of the radix tree. * * The returned object is allocated in backend-local memory. */ We need to update the comment about rt_context too since we no longer use rt_context for shared tidstore. > > Also, with a bit more work we could allow a NULL context for when the > caller has purposely arranged to use pointer-sized values. Did you see > any of Heikki's CSN patches? There is a radix tree used as a cache in > a context where the tree could be created and destroyed frequently. > Something about the memory blocks seems to have tickled some bad case > in the glibc allocator, and one less context might be good insurance: > > https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi Will check these patches. Thanks! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com