Thread: BUG #18141: sorry, too many clients error occurring very frequently

BUG #18141: sorry, too many clients error occurring very frequently

From

PG Bug reporting form

Date:

29 September 2023, 14:31:55

The following bug has been logged on the website:

Bug reference:      18141
Logged by:          Ravi Agrawal
Email address:      ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system:   RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations  and crossing
the max_connections value. This has not been observed previously on Postgres
v13.4 
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific
version of PostgreSQL.

Appreciate your response on this.

Many Thanks,
Ravi Agrawal

BUG #18141: sorry, too many clients error occurring very frequently

From

"Wetmore, Matthew (CTR)"

Date:

29 September 2023, 18:12:15

I can second this behavior.  We did a few things to mitigate the issue.  HA may be at play here too.
We're still monitoring.

13.11 + HA - these issues
13.6 , no HA - no issues


-----Original Message-----
From: PG Bug reporting form <noreply@postgresql.org> 
Sent: Friday, September 29, 2023 7:32 AM
To: pgsql-bugs@lists.postgresql.org
Cc: ravi.s.agrawal23@gmail.com
Subject: [EXTERNAL] BUG #18141: sorry, too many clients error occurring very frequently

The following bug has been logged on the website:

Bug reference:      18141
Logged by:          Ravi Agrawal
Email address:      ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system:   RHEL
Description:        

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry, too many clients' error very frequently.
Connection count is shooting up on performing basic operations  and crossing the max_connections value. This has not
beenobserved previously on Postgres
 
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific version of PostgreSQL.

Appreciate your response on this.

Many Thanks,
Ravi Agrawal

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Joe Conway

Date:

29 September 2023, 18:12:19

On 9/29/23 10:31, PG Bug reporting form wrote:
> We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry,
> too many clients' error very frequently.
> Connection count is shooting up on performing basic operations  and crossing
> the max_connections value. This has not been observed previously on Postgres
> v13.4
> We need to understand if this issue is caused due to the change in version.
> Is there any history of known incidents facing this issue with the specific
> version of PostgreSQL.


Question -- are you using pg_stat_statements? If so, please try 
disabling it and see if the issue goes away. Also was is your work_mem 
setting? Does raising it help?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Jeff Janes

Date:

30 September 2023, 18:04:56

On Fri, Sep 29, 2023 at 7:55 PM PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18141
Logged by: Ravi Agrawal
Email address: ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system: RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations and crossing
the max_connections value. This has not been observed previously on Postgres
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific
version of PostgreSQL.

I think it is unlikely to be a bug. How did you do the upgrade? Just install the new binaries, then restart the server?

Can you capture plans from the newly slow queries and compare them to plans from the same queries before the upgrade?

Cheers,

Jeff

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Joe Conway

Date:

01 October 2023, 14:34:15

On 9/30/23 14:04, Jeff Janes wrote:
> On Fri, Sep 29, 2023 at 7:55 PM PG Bug reporting form 
> <noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote:
> 
>     The following bug has been logged on the website:
> 
>     Bug reference:      18141
>     Logged by:          Ravi Agrawal
>     Email address: ravi.s.agrawal23@gmail.com
>     <mailto:ravi.s.agrawal23@gmail.com>
>     PostgreSQL version: 13.11
>     Operating system:   RHEL
>     Description:
> 
>     Hi Team,
> 
>     We have updated PostgresSQL from version 13.4 to 13.11 and observing
>     'sorry,
>     too many clients' error very frequently.
>     Connection count is shooting up on performing basic operations  and
>     crossing
>     the max_connections value. This has not been observed previously on
>     Postgres
>     v13.4
>     We need to understand if this issue is caused due to the change in
>     version.
>     Is there any history of known incidents facing this issue with the
>     specific
>     version of PostgreSQL.
> 
> 
> I think it is unlikely to be a bug.  How did you do the upgrade?  Just 
> install the new binaries, then restart the server?

Maybe, maybe not. I have seen two other cases that are similar. One was 
an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.

I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the 
same day as 13.8.

In both cases queries against pg_stat_statements suddenly started taking 
more memory leading to spillage to pg_temp and a step degradation in 
overall performance. In at least one of those cases the 
solution/workaround was to increase work_mem. In the other I think 
pg_stat_statements was disabled.

Myself and at least one other hacker looked at the pg_stat_statements 
specific changes in that time interval and saw no smoking gun.

But it is possible that something else backpatched to both branches 
between Aug 09, 2021 and Aug 8, 2022 has caused a more general 
performance regression which we have yet to track down.

At least based on this sample of two (now maybe 3?) folks with similar 
symptoms.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Tom Lane

Date:

01 October 2023, 18:09:48

Joe Conway <mail@joeconway.com> writes:
> Maybe, maybe not. I have seen two other cases that are similar. One was 
> an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.

> I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the 
> same day as 13.8.

> In both cases queries against pg_stat_statements suddenly started taking 
> more memory leading to spillage to pg_temp and a step degradation in 
> overall performance. In at least one of those cases the 
> solution/workaround was to increase work_mem. In the other I think 
> pg_stat_statements was disabled.

> Myself and at least one other hacker looked at the pg_stat_statements 
> specific changes in that time interval and saw no smoking gun.

> But it is possible that something else backpatched to both branches 
> between Aug 09, 2021 and Aug 8, 2022 has caused a more general 
> performance regression which we have yet to track down.

Hmm.  My first instinct is to wonder about changes in plan selection.
How complex were the troublesome queries?

            regards, tom lane

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Joe Conway

Date:

01 October 2023, 18:27:05

On 10/1/23 14:09, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> Maybe, maybe not. I have seen two other cases that are similar. One was 
>> an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.
> 
>> I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the 
>> same day as 13.8.
> 
>> In both cases queries against pg_stat_statements suddenly started taking 
>> more memory leading to spillage to pg_temp and a step degradation in 
>> overall performance. In at least one of those cases the 
>> solution/workaround was to increase work_mem. In the other I think 
>> pg_stat_statements was disabled.
> 
>> Myself and at least one other hacker looked at the pg_stat_statements 
>> specific changes in that time interval and saw no smoking gun.
> 
>> But it is possible that something else backpatched to both branches 
>> between Aug 09, 2021 and Aug 8, 2022 has caused a more general 
>> performance regression which we have yet to track down.
> 
> Hmm.  My first instinct is to wonder about changes in plan selection.
> How complex were the troublesome queries?

I think both of these cases involved a number of common attributes:

* The queries against pg_stat_statements were
   relatively complex
* The other queries on the system were relatively
   long and complex (and thus the query string length
   in pg_stat_statements)
* Prior to the upgrade the systems were overall
   keeping up, but extremely busy

In one case it seems that the upgrade caused a significant increase of 
temp file usage. This impacted the system enough that other active 
queries took longer, and thus number of active connections increased. 
Raising work_mem eliminated the temp file usage and cpu loads dropped 
back to similar levels as they were prior to the minor upgrade.

The other case had different specifics, but generally involved increased 
memory usage. In that one eliminating the use of pg_stat_statements 
restored performance. They did not try raising work_mem (as I understand 
it), and I did not get any info regarding temp file spillage there.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #18141: sorry, too many clients error occurring very frequently

From

Tom Lane

Date:

01 October 2023, 19:45:55

Joe Conway <mail@joeconway.com> writes:
> On 10/1/23 14:09, Tom Lane wrote:
>> Joe Conway <mail@joeconway.com> writes:
>>> But it is possible that something else backpatched to both branches 
>>> between Aug 09, 2021 and Aug 8, 2022 has caused a more general 
>>> performance regression which we have yet to track down.

>> Hmm.  My first instinct is to wonder about changes in plan selection.
>> How complex were the troublesome queries?

> I think both of these cases involved a number of common attributes:
> * The queries against pg_stat_statements were
>    relatively complex
> * The other queries on the system were relatively
>    long and complex (and thus the query string length
>    in pg_stat_statements)
> * Prior to the upgrade the systems were overall
>    keeping up, but extremely busy
> In one case it seems that the upgrade caused a significant increase of 
> temp file usage. This impacted the system enough that other active 
> queries took longer, and thus number of active connections increased. 
> Raising work_mem eliminated the temp file usage and cpu loads dropped 
> back to similar levels as they were prior to the minor upgrade.

Interesting.  After some desultory trawling through the commit log,
I'm wondering if there could be some connection to

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: REL_13_STABLE Release: REL_13_5 [dede14399] 2021-09-20 11:48:52 -0400
Branch: REL_12_STABLE Release: REL_12_9 [f230614da] 2021-09-20 11:48:52 -0400
Branch: REL_11_STABLE Release: REL_11_14 [914e54501] 2021-09-20 11:48:52 -0400
Branch: REL_10_STABLE Release: REL_10_19 [923b7efc2] 2021-09-20 11:48:52 -0400
Branch: REL9_6_STABLE Release: REL9_6_24 [183b3aced] 2021-09-20 11:48:52 -0400

    Don't elide casting to typmod -1.

That's in the right timeframe for these reports.  A fairly direct
connection to pg_stat_statements could be made: addition of
RelabelType nodes that weren't there before would change the query ID
hash of affected statements, so that after running for a little while
the pg_stat_statements hash would have a bunch of duplicate entries
(assuming you didn't bump up against pg_stat_statements.max).
As long as nothing happened to age out the now-dead entries with
the old query hashes, you'd have more rows in the pg_stat_statements
view than before, which could explain performance decreases in
queries on that view.

The big hole in this theory is that I would not have expected this
casting change to affect any large proportion of SQL commands,
so that it's a bit hard to credit it causing a lot of bloat in
pg_stat_statements.  Still, some specific coding habit or DDL
detail could maybe allow that to happen in a particular application.

Another line of thought is that the extra RelabelType nodes could
block a planner optimization that used to occur before.  It's not
apparent why that would manifest specifically in connection with
pg_stat_statements queries though.  If that is the problem,
we'd need an example of a query whose plan changed in order to pin
down the cause.

Anyway, that's just a theory, and it might be hot air.

            regards, tom lane