RE: memory leak in logical WAL sender with pgoutput's cachectx - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: memory leak in logical WAL sender with pgoutput's cachectx
Date
Msg-id OS0PR01MB5716C4A6AEC69BE733820AA49434A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: memory leak in logical WAL sender with pgoutput's cachectx  (Xuneng Zhou <xunengzhou@gmail.com>)
List pgsql-hackers
On Friday, August 15, 2025 10:59 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Thanks for your clarification!
> 
> On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Xuneng,
> > >
> > > > Is it safe to free the substructure from within
> rel_sync_cache_relation_cb()?
> > >
> > > You referred the comment in rel_sync_cache_relation_cb() right? I
> > > understood like that we must not access to any *system caches*, from
> > > the comment. Here we do not re-build caches so that we do not access
> > > to the syscaches - it is permitted.
> >
> > We should also avoid freeing the cache within the callback function,
> > as entries might still be accessed after the callback is invoked. This
> > callback can be triggered during the execution of pgoutput_change,
> > which poses a risk of concurrent access issues. For reference, see
> > commit 7f481b8, which addresses a similar issue. Ideally, cache
> > release should occur within the normal code execution path, such as within
> pgoutput_change() or get_rel_sync_entry().
> 
> I am still getting familiar with this module, so I cannot make a sound
> confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and
> trace the callback paths.
> 
> > >
> > > > I’ also interested in the reasoning behind setting
> > > > NINVALIDATION_THRESHOLD to 100.
> > >
> > > This is the debatable point of this implementation. I set to 100
> > > because it was sufficient with the provided workload, but this may cause
> the replication lag.
> > > We may have to consider a benchmark workload, measure data, and
> > > consider the appropriate value. 100 is just an initial point.
> >
> > I think it's worthwhile to test the performance impact of this
> > approach. If the table is altered but not dropped, removing the
> > entries could introduce additional overhead. Although this overhead
> > might be negligible if within an acceptable threshold, it still
> > warrants consideration. Additionally, if the number of invalidations
> > is significant, the cleanup process may take a considerable amount of time,
> potentially resulting in performance degradation.
> > In such cases, it might be beneficial to consider destroying the hash
> > table and creating a new one. Therefore, analyzing these scenarios is
> essential.
> 
> Intuitively, a heuristic threshold is straightforward but also can be hard to get it
> "right" since the workload varies.
> Designing and writing tests to reflect the real-world use cases well may not be
> that easy as well. I'm wondering whether it's beneficial to brainstorm other
> potential alternatives before settling on the current approach and start
> benchmarking and performance tuning.
> 
> Sequential cleaning could be costful in certain cases. I don't know much about
> RelationSyncCache's life cycle, but it seems that we should be cautious on
> shortening it without causing subtle issues.

I don't have a great idea that can avoid introducing overhead in all cases.

An alternative approach I considered before is deleting the hash entry upon
invalidation if the entry is not in use. This would require adding an in_use
flag in RelationSyncEntry, setting it to true when get_rel_sync_entry is called,
and marking it as false once the entry is no longer in use. This approach is
inspired from the relcache invalidation. I am slightly not sure if it's worth
the effort, but just share this idea with a POC patch for reference, which has
been modified based on Kuroda-San's version.

Best Regards,
Hou zj


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Compilation issues for HASH_STATISTICS and HASH_DEBUG options
Next
From: Thomas Munro
Date:
Subject: Re: Potential deadlock in pgaio_io_wait()