Re: client_connection_check_interval default value - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: client_connection_check_interval default value
Date
Msg-id CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com
Whole thread Raw
In response to Re: client_connection_check_interval default value  (Jeremy Schneider <schneider@ardentperf.com>)
Responses Re: client_connection_check_interval default value
List pgsql-hackers
On Fri, Feb 6, 2026 at 8:05 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
>
> One interesting thing to me - it seems like all of the past mail
> threads were focused on a situation different from mine. Lots of
> discussion about freeing resources like CPU.
>
> In the outage I saw, the system was idle and we completely ran out of
> max_connections because all sessions were waiting on a row lock.
>
> Importantly, the app was closing these conns but we had sockets stacking
> up on the server in CLOSE-WAIT state - and postgres simply never
> cleaned them up until we had an outage. The processes were completely
> idle waiting for a row lock that was not going to be released.
>
> Impact could have been isolated to sessions hitting that row (with this
> GUC), but it escalated to a system outage. It's pretty simple to
> reproduce this:
> https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion
>
>
> On Thu, 5 Feb 2026 09:26:34 -0800
> Jacob Champion <jacob.champion@enterprisedb.com> wrote:
>
> > On Wed, Feb 4, 2026 at 9:30 PM Jeremy Schneider
> > <schneider@ardentperf.com> wrote:
> > > While a fix has been merged in pgx for the most direct root cause of
> > > the incident I saw, this setting just seems like a good behavior to
> > > make Postgres more robust in general.
> >
> > At the risk of making perfect the enemy of better, the protocol-level
> > heartbeat mentioned in the original thread [1] would cover more use
> > cases, which might give it a better chance of eventually becoming
> > default behavior. It might also be a lot of work, though.
>
> It seems like a fair bit of discussion is around OS coverage - even
> Thomas' message there references keepalive working as expected on
> Linux. Tom objects in 2023 that "the default behavior would then be
> platform-dependent and that's a documentation problem we could do
> without."
>
> But it's been five years - has there been further work on implementing
> a postgres-level heartbeat? And I see other places in the docs where we
> note platform differences, is it really such a big problem to change
> the default here?
>
>
> On Thu, 5 Feb 2026 10:00:29 -0500
> Greg Sabino Mullane <htamfids@gmail.com> wrote:
>
> > I'm a weak -1 on this. Certainly not 2s! That's a lot of context
> > switching for a busy system for no real reason. Also see this past
> > discussion:
>
> In the other thread I see larger perf concerns with some early
> implementations before they refactored the patch? Konstantin's message
> on 2019-08-02 said he didn't see much difference, and the value of the
> timeout didn't seem to matter, and if anything the marginal effect was
> simply from the presence of any timer (same effect as setting
> statement_timeout) - and later on the thread it seems like Thomas also
> saw minimal performance concern here.
>
> I did see a real system outage that could have been prevented by an
> appropriate default value here, since I didn't yet know to change it.

I'm not sure that client_connection_check_interval needs to be enabled
by default. However, if we do agree to change the default and apply it,
I think we should first address the related issue: with log_lock_waits enabled
by default, setting client_connection_check_interval to 2s would cause
"still waiting" messages to be logged every 2 seconds during waiting on
the lock. That could result in a lot of noisy logging under default settings.

The issue is that backends blocked in ProcSleep() are woken up every
client_connection_check_interval and may emit a "still waiting" message
each time if log_lock_waits is enabled. To mitigate this, just one idea is
to add a flag to track whether the "still waiting" message has already been
emitted during a call to ProcSleep(), and suppress further messages
once it has been logged.

Regards,

--
Fujii Masao



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Decoupling our alignment assumptions about int64 and double
Next
From: Tom Lane
Date:
Subject: Re: Decoupling our alignment assumptions about int64 and double