Re: client_connection_check_interval default value - Mailing list pgsql-hackers
| From | Fujii Masao |
|---|---|
| Subject | Re: client_connection_check_interval default value |
| Date | |
| Msg-id | CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com Whole thread Raw |
| In response to | Re: client_connection_check_interval default value (Jeremy Schneider <schneider@ardentperf.com>) |
| Responses |
Re: client_connection_check_interval default value
|
| List | pgsql-hackers |
On Fri, Feb 6, 2026 at 8:05 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > > One interesting thing to me - it seems like all of the past mail > threads were focused on a situation different from mine. Lots of > discussion about freeing resources like CPU. > > In the outage I saw, the system was idle and we completely ran out of > max_connections because all sessions were waiting on a row lock. > > Importantly, the app was closing these conns but we had sockets stacking > up on the server in CLOSE-WAIT state - and postgres simply never > cleaned them up until we had an outage. The processes were completely > idle waiting for a row lock that was not going to be released. > > Impact could have been isolated to sessions hitting that row (with this > GUC), but it escalated to a system outage. It's pretty simple to > reproduce this: > https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion > > > On Thu, 5 Feb 2026 09:26:34 -0800 > Jacob Champion <jacob.champion@enterprisedb.com> wrote: > > > On Wed, Feb 4, 2026 at 9:30 PM Jeremy Schneider > > <schneider@ardentperf.com> wrote: > > > While a fix has been merged in pgx for the most direct root cause of > > > the incident I saw, this setting just seems like a good behavior to > > > make Postgres more robust in general. > > > > At the risk of making perfect the enemy of better, the protocol-level > > heartbeat mentioned in the original thread [1] would cover more use > > cases, which might give it a better chance of eventually becoming > > default behavior. It might also be a lot of work, though. > > It seems like a fair bit of discussion is around OS coverage - even > Thomas' message there references keepalive working as expected on > Linux. Tom objects in 2023 that "the default behavior would then be > platform-dependent and that's a documentation problem we could do > without." > > But it's been five years - has there been further work on implementing > a postgres-level heartbeat? And I see other places in the docs where we > note platform differences, is it really such a big problem to change > the default here? > > > On Thu, 5 Feb 2026 10:00:29 -0500 > Greg Sabino Mullane <htamfids@gmail.com> wrote: > > > I'm a weak -1 on this. Certainly not 2s! That's a lot of context > > switching for a busy system for no real reason. Also see this past > > discussion: > > In the other thread I see larger perf concerns with some early > implementations before they refactored the patch? Konstantin's message > on 2019-08-02 said he didn't see much difference, and the value of the > timeout didn't seem to matter, and if anything the marginal effect was > simply from the presence of any timer (same effect as setting > statement_timeout) - and later on the thread it seems like Thomas also > saw minimal performance concern here. > > I did see a real system outage that could have been prevented by an > appropriate default value here, since I didn't yet know to change it. I'm not sure that client_connection_check_interval needs to be enabled by default. However, if we do agree to change the default and apply it, I think we should first address the related issue: with log_lock_waits enabled by default, setting client_connection_check_interval to 2s would cause "still waiting" messages to be logged every 2 seconds during waiting on the lock. That could result in a lot of noisy logging under default settings. The issue is that backends blocked in ProcSleep() are woken up every client_connection_check_interval and may emit a "still waiting" message each time if log_lock_waits is enabled. To mitigate this, just one idea is to add a flag to track whether the "still waiting" message has already been emitted during a call to ProcSleep(), and suppress further messages once it has been logged. Regards, -- Fujii Masao
pgsql-hackers by date: