Home > mailing lists

Re: connections not getting closed on a replica - Mailing list pgsql-general

From	Carlo Cabanilla
Subject	Re: connections not getting closed on a replica
Date	December 12, 2015 00:37:31
Msg-id	CACiJR+WatQk-yuv44ScUKSqn+D4YdtojV8N=_b3kp8GPu7AHhQ@mail.gmail.com Whole thread Raw
In response to	Re: connections not getting closed on a replica (Kevin Grittner <kgrittn@gmail.com>)
Responses	Re: connections not getting closed on a replica (Kevin Grittner <kgrittn@gmail.com>)
List	pgsql-general

Tree view

Thanks for the reply Kevin.

> I'm trying to figure out why we had a build up of connections on
> our streaming replica.

Seriously, from the data provided, about all I can say is "because
you were opening them faster than you were closing them". You
don't say how many cores or how much RAM you had, but allowing 2000
connections is rarely a good idea.

http://tbeitr.blogspot.com/2015/11/for-better-service-please-take-number.html

The replica is an i2.4xl ec2 instance, so 122gb ram, 16 cores, 53 "compute units", 4 local storage ssds in raid 10. effective_cache_size is 89gb, work_mem is 128mb. Applications are connecting via pgbouncer with a default pool size of 650, steady state of 500-600 server connections for 1000-1500 clients. Normally pgbouncer limits the connections pretty well. I was most surprised when postgres wasn't timing connections out to keep them low.

Also, pgbouncer wasn't reporting an increase in the number of server connections even though they were showing up in postgres directly, so I was thinking pgbouncer thought these connections were already closed even though postgres couldn't close them.

> We're running postgres 9.3.5 on the master and 9.3.10 on
> the replica,

Because of data loss bugs in 9.3.5 which are fixed in 9.3.10, I
recommend you update the master ASAP.

Yep, we're working on it.

> linux 3.2.0 on both,

There are scheduler problems in kernels at that level; I recommend
upgrading linux to a 3.8 or later kernel; 3.13 or later is best, if
possible.

Ditto.

I don't suppose you have vmstat 1 output from the incident? If it
happens again, try to capture that.

Are you looking for a stat in particular?

This is the idle of each core during the first incident:

http://dd-pastebin.s3.amazonaws.com/carlo/pg-connections-issue2.png

The dark blue at the bottom is core0, which I believe is the one pgbouncer is using. There's a drop when the connections start running away, but there's still some headroom there, and the other cores are pretty much unaffected until the restart.

Have you tuned the OS vm.dirty_* settings? How about the
transparent_hugepage settings?

I think the vm dirty settings are default:

vm.dirty_background_bytes = 0

vm.dirty_ratio = 20

vm.dirty_bytes = 0

vm.dirty_writeback_centisecs = 500

vm.dirty_expire_centisecs = 3000

If this is a NUMA environment, what does this show?:

numactl --hardware

available: 1 nodes (0)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

node 0 size: 124999 MB

node 0 free: 740 MB

node distances:

node 0

0: 10

Carlo Cabanilla

Datadog

pgsql-general by date:

From: John R Pierce
Date: 12 December 2015, 00:16:01
Subject: Re: Complete Recovery 9.4.4

From: Kevin Grittner
Date: 12 December 2015, 00:49:01
Subject: Re: Complete Recovery 9.4.4

Re: connections not getting closed on a replica - Mailing list pgsql-general

Previous

Next