Re: Invalid pointer access in logical decoding after error - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Invalid pointer access in logical decoding after error |
Date | |
Msg-id | CAD21AoAFHknoV0XCrcQerPys4PQijyjSSAGOq5yMhZ+9pHGc4g@mail.gmail.com Whole thread Raw |
In response to | Re: Invalid pointer access in logical decoding after error (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Invalid pointer access in logical decoding after error
|
List | pgsql-hackers |
On Mon, Sep 29, 2025 at 1:52 AM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > I encountered an invalid pointer access issue. Below are the steps to > > > > > reproduce the issue: > > > > ... > > > > > The error occurs because entry->columns is allocated in the entry > > > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This > > > > > context is a child of the PortalContext, which is cleared after an > > > > > error via: AbortTransaction > > > > > -> AtAbort_Portals -> > > > > > MemoryContextDeleteChildren -> MemoryContextDelete -> > > > > > MemoryContextDeleteOnly > > > > > As a result, the memory backing entry->columns is freed, but the > > > > > RelationSyncCache which resides in CacheMemoryContext and thus > > > > > survives the error still holds a dangling pointer to this freed > > > > > memory, causing it to pfree an invalid pointer. > > > > > In the normal (positive) execution flow, pgoutput_shutdown() is called > > > > > to clean up the RelationSyncCache. This happens via: > > > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But > > > > > this is not called in case of an error case. To handle this case > > > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to > > > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately. > > > > > Attached patch has the changes for the same. > > > > > Thoughts? > > > > > > > > Thank you for reporting the issue and providing a fix. > > > > > > > > I recall that we identified this general issue with the hash table in pgoutput > > > > in other threads as well [1]. The basic consensus [2] is that calling > > > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes > > > > user code, increasing the risk of encountering another error within PG_CATCH. > > > > This scenario could prevent execution of subsequent code to invalidate syscache > > > > entries, which is problematic. > > > > > > Yes, let's avoid this. > > > > > > > I think a better fix could be to introduce a memory context reset callback(on > > > > data->cachectx) and perform the actions of pgoutput_shutdown() within it. > > > > > > The attached v2 version patch has the changes for the same. > > > > We've addressed several memory-related issues in pgoutput. While most > > of these issues didn't affect logical replication, they did impact > > logical decoding called via SQL API. I find that these problems stem > > from RelationSyncCache being defined as a file-scope static variable > > and being allocated in CacheMemoryContext. I'm wondering if we could > > move it to PGOutputData and create it under the logical decoding > > context. This would ensure it's automatically cleaned up along with > > the logical decoding context. > > I see one issue with keeping RelationSyncCache inside PGOutputData. > Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely > on this cache: they check its validity and update it in response to > invalidation events. The problem is that these callbacks cannot be > unregistered once they are registered. After pgoutput_shutdown() runs, > the hash table is destroyed and later FreeDecodingContext deletes the > logical decoding context. At that point, the registered callbacks may > still fire, but they’ll be holding onto a freed memory. > > This is also noted directly in the callbacks: > /* > * We can get here if the plugin was used in SQL interface as the > * RelationSyncCache is destroyed when the decoding finishes, but there is > * no way to unregister the relcache invalidation callback. > */ > if (RelationSyncCache == NULL) > return; > > So the core issue is that the callbacks may outlive the cache they > reference, and we don’t have a mechanism to unregister them cleanly. > > Thoughts? I agree with your analysis. It seems there is no convenient way to move RelationSyncCache inside PGOutputData. So let's proceed with the proposed approach. I've done some minor changes to your v2 patch and updated the commit message. IIUC this patch needs to be backpatched to v15. Please review the attached patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: