Re: Invalid pointer access in logical decoding after error - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Invalid pointer access in logical decoding after error |
Date | |
Msg-id | CALDaNm2RPmNpyhiLj4WRn70izxnM7Cbmo53HUffZtg72MwZFCQ@mail.gmail.com Whole thread Raw |
In response to | Re: Invalid pointer access in logical decoding after error (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote: > > > > > > > > > > > Hi, > > > > > > > > I encountered an invalid pointer access issue. Below are the steps to > > > > reproduce the issue: > > > ... > > > > The error occurs because entry->columns is allocated in the entry > > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This > > > > context is a child of the PortalContext, which is cleared after an > > > > error via: AbortTransaction > > > > -> AtAbort_Portals -> > > > > MemoryContextDeleteChildren -> MemoryContextDelete -> > > > > MemoryContextDeleteOnly > > > > As a result, the memory backing entry->columns is freed, but the > > > > RelationSyncCache which resides in CacheMemoryContext and thus > > > > survives the error still holds a dangling pointer to this freed > > > > memory, causing it to pfree an invalid pointer. > > > > In the normal (positive) execution flow, pgoutput_shutdown() is called > > > > to clean up the RelationSyncCache. This happens via: > > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But > > > > this is not called in case of an error case. To handle this case > > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to > > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately. > > > > Attached patch has the changes for the same. > > > > Thoughts? > > > > > > Thank you for reporting the issue and providing a fix. > > > > > > I recall that we identified this general issue with the hash table in pgoutput > > > in other threads as well [1]. The basic consensus [2] is that calling > > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes > > > user code, increasing the risk of encountering another error within PG_CATCH. > > > This scenario could prevent execution of subsequent code to invalidate syscache > > > entries, which is problematic. > > > > Yes, let's avoid this. > > > > > I think a better fix could be to introduce a memory context reset callback(on > > > data->cachectx) and perform the actions of pgoutput_shutdown() within it. > > > > The attached v2 version patch has the changes for the same. > > We've addressed several memory-related issues in pgoutput. While most > of these issues didn't affect logical replication, they did impact > logical decoding called via SQL API. I find that these problems stem > from RelationSyncCache being defined as a file-scope static variable > and being allocated in CacheMemoryContext. I'm wondering if we could > move it to PGOutputData and create it under the logical decoding > context. This would ensure it's automatically cleaned up along with > the logical decoding context. I see one issue with keeping RelationSyncCache inside PGOutputData. Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely on this cache: they check its validity and update it in response to invalidation events. The problem is that these callbacks cannot be unregistered once they are registered. After pgoutput_shutdown() runs, the hash table is destroyed and later FreeDecodingContext deletes the logical decoding context. At that point, the registered callbacks may still fire, but they’ll be holding onto a freed memory. This is also noted directly in the callbacks: /* * We can get here if the plugin was used in SQL interface as the * RelationSyncCache is destroyed when the decoding finishes, but there is * no way to unregister the relcache invalidation callback. */ if (RelationSyncCache == NULL) return; So the core issue is that the callbacks may outlive the cache they reference, and we don’t have a mechanism to unregister them cleanly. Thoughts? Regards, Vignesh
pgsql-hackers by date: