Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Date
Msg-id CALj2ACXhM_yTBXXTCA-whZ_oA1x3DuGxyKnsROjMh8YZLG9bJw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
List pgsql-hackers
On Fri, Jan 29, 2021 at 11:08 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Jan 29, 2021 at 10:55 AM Fujii Masao
> <masao.fujii@oss.nttdata.com> wrote:
> > On 2021/01/29 14:12, Bharath Rupireddy wrote:
> > > On Fri, Jan 29, 2021 at 10:28 AM Fujii Masao
> > > <masao.fujii@oss.nttdata.com> wrote:
> > >> On 2021/01/29 11:09, Tom Lane wrote:
> > >>> Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> > >>>> On Fri, Jan 29, 2021 at 1:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-01-26%2019%3A59%3A40
> > >>>>> This is a CLOBBER_CACHE_ALWAYS build, so I suspect what it's
> > >>>>> telling us is that the patch's behavior is unstable in the face
> > >>>>> of unexpected cache flushes.
> > >>>
> > >>>> Thanks a lot! It looks like the syscache invalidation messages are
> > >>>> generated too frequently with -DCLOBBER_CACHE_ALWAYS build due to
> > >>>> which pgfdw_inval_callback gets called many times in which the cached
> > >>>> entries are marked as invalid and closed if they are not used in the
> > >>>> txn. The new function postgres_fdw_get_connections outputs the
> > >>>> information of the cached connections such as name if the connection
> > >>>> is still open and their validity. Hence the output of the
> > >>>> postgres_fdw_get_connections became unstable in the buildfarm member.
> > >>>> I will further analyze making tests stable, meanwhile any suggestions
> > >>>> are welcome.
> > >>>
> > >>> I do not think you should regard this as "we need to hack the test
> > >>> to make it stable".  I think you should regard this as "this is a
> > >>> bug".  A cache flush should not cause user-visible state changes.
> > >>> In particular, the above analysis implies that you think a cache
> > >>> flush is equivalent to end-of-transaction, which it absolutely
> > >>> is not.
> > >>>
> > >>> Also, now that I've looked at pgfdw_inval_callback, it scares
> > >>> the heck out of me.  Actually disconnecting a connection during
> > >>> a cache inval callback seems quite unsafe --- what if that happens
> > >>> while we're using the connection?
> > >>
> > >> If the connection is still used in the transaction, pgfdw_inval_callback()
> > >> marks it as invalidated and doesn't close it. So I was not thinking that
> > >> this is so unsafe.
> > >>
> > >> The disconnection code in pgfdw_inval_callback() was added in commit
> > >> e3ebcca843 to fix connection leak issue, and it's back-patched. If this
> > >> change is really unsafe, we need to revert it immediately at least from back
> > >> branches because the next minor release is scheduled soon.
> > >
> > > I think we can remove disconnect_pg_server in pgfdw_inval_callback and
> > > make entries only invalidated. Anyways, those connections can get
> > > closed at the end of main txn in pgfdw_xact_callback. Thoughts?
> >
> > But this revives the connection leak issue. So isn't it better to
> > to do that after we confirm that the current code is really unsafe?
>
> IMO, connections will not leak, because the invalidated connections
> eventually will get closed in pgfdw_xact_callback at the main txn end.
>
> IIRC, when we were finding a way to close the invalidated connections
> so that they don't leaked, we had two options:
>
> 1) let those connections (whether currently being used in the xact or
> not) get marked invalidated in pgfdw_inval_callback and closed in
> pgfdw_xact_callback at the main txn end as shown below
>
>         if (PQstatus(entry->conn) != CONNECTION_OK ||
>             PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
>             entry->changing_xact_state ||
>             entry->invalidated).   ----> by adding this
>         {
>             elog(DEBUG3, "discarding connection %p", entry->conn);
>             disconnect_pg_server(entry);
>         }
>
> 2) close the unused connections right away in pgfdw_inval_callback
> instead of marking them invalidated. Mark used connections as
> invalidated in pgfdw_inval_callback and close them in
> pgfdw_xact_callback at the main txn end.
>
> We went with option (2) because we thought this would ease some burden
> on pgfdw_xact_callback closing a lot of invalid connections at once.

Also, see the original patch for the connection leak issue just does
option (1), see [1]. But in [2] and [3], we chose option (2).

I feel, we can go for option (1), with the patch attached in [1] i.e.
having have_invalid_connections whenever any connection gets invalided
so that we don't quickly exit in pgfdw_xact_callback and the
invalidated connections get closed properly. Thoughts?

static void
pgfdw_xact_callback(XactEvent event, void *arg)
{
    HASH_SEQ_STATUS scan;
    ConnCacheEntry *entry;

    /* Quick exit if no connections were touched in this transaction. */
    if (!xact_got_connection)
        return;

[1] https://www.postgresql.org/message-id/CALj2ACVNcGH_6qLY-4_tXz8JLvA%2B4yeBThRfxMz7Oxbk1aHcpQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/f57dd9c3-0664-5f4c-41f0-0713047ae7b7%40oss.nttdata.com
[3] https://www.postgresql.org/message-id/CALj2ACVNjV1%2B72f3nVCngC7RsGSiGXZQ2mAzYx_Dij7oJpV8iA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Next
From: Amit Kapila
Date:
Subject: Re: Why does creating logical replication subscriptions require superuser?