Thread: connections not getting closed on a replica
Hello,
I'm trying to figure out why we had a build up of connections on our streaming replica. We're running postgres 9.3.5 on the master and 9.3.10 on the replica, linux 3.2.0 on both, disks in raid10. Here are some graphs to illustrate what happened:
There were 3 different instances, as shown by the spikes in server connections graph. Each incident, the replica was still serving some requests but much slower, and connections kept building up. Pg_cancel_backend(), pg_terminate_backend() and the server timeout of 5s weren't closing them. Eventually we had to restart the database to get back to normal which isn't ideal.
The second graph shows the rows updated and hot updated on the master (n_tup_upd and n_tup_hot_upd from pg_stat_user_tables) on a large, frequently updated table (27M rows, 20GB). It looks like hot updates start going down right before connections run away. Going through the code, I found this comment: https://github.com/postgres/postgres/blob/REL9_3_5/src/backend/access/heap/heapam.c#L3031-L3041
Which from my understanding means that a hot update is a weaker lock, but if column is being "detoasted" an update will use the stronger lock instead. I'm not sure what detoasted means, but I do know that table has a toast table as it's using an hstore column. We can store pretty large values in the hstore column, so I wouldn't be surprised if that was causing the issue.
The third graph shows toast block cache hits o (toast_blks_hit from pg_statio_user_tables) and shows that the hits on the replica (purple) took a nose dive after the first incident and pretty much never recovered since then. Does that mean the toast table is too large to fit in cache?
The fourth graph shows the rate of dead row creation on the master, with the large blue area being the table in question. I'm assuming the dead rows are from updates on the hstore, but not certain.
The fifth graph shows the disk write throughput on the replica, which I'm assuming is from applying the WAL updates on that large table? Write throughput was higher than normal, but not maxing out the capacity of the disk. Write latency was minimal.
The 6th graph is the replica lag, which only fell behind during the first incident and not the others.
My question is why was the postgres replica unable to close connections during these incidents? And why were reads on the replica so slow at the time? Are there locks that are locking out the readers when there's lots of update churn?
Carlo Cabanilla
Datadog
On Thu, Dec 10, 2015 at 5:13 PM, Carlo Cabanilla <carlo@datadoghq.com> wrote: > I'm trying to figure out why we had a build up of connections on > our streaming replica. Seriously, from the data provided, about all I can say is "because you were opening them faster than you were closing them". You don't say how many cores or how much RAM you had, but allowing 2000 connections is rarely a good idea. http://tbeitr.blogspot.com/2015/11/for-better-service-please-take-number.html > We're running postgres 9.3.5 on the master and 9.3.10 on > the replica, Because of data loss bugs in 9.3.5 which are fixed in 9.3.10, I recommend you update the master ASAP. > linux 3.2.0 on both, There are scheduler problems in kernels at that level; I recommend upgrading linux to a 3.8 or later kernel; 3.13 or later is best, if possible. I don't suppose you have vmstat 1 output from the incident? If it happens again, try to capture that. Have you tuned the OS vm.dirty_* settings? How about the transparent_hugepage settings? If this is a NUMA environment, what does this show?: numactl --hardware -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for the reply Kevin.
> I'm trying to figure out why we had a build up of connections on
> our streaming replica.
Seriously, from the data provided, about all I can say is "because
you were opening them faster than you were closing them". You
don't say how many cores or how much RAM you had, but allowing 2000
connections is rarely a good idea.
http://tbeitr.blogspot.com/2015/11/for-better-service-please-take-number.html
The replica is an i2.4xl ec2 instance, so 122gb ram, 16 cores, 53 "compute units", 4 local storage ssds in raid 10. effective_cache_size is 89gb, work_mem is 128mb. Applications are connecting via pgbouncer with a default pool size of 650, steady state of 500-600 server connections for 1000-1500 clients. Normally pgbouncer limits the connections pretty well. I was most surprised when postgres wasn't timing connections out to keep them low.
Also, pgbouncer wasn't reporting an increase in the number of server connections even though they were showing up in postgres directly, so I was thinking pgbouncer thought these connections were already closed even though postgres couldn't close them.
> We're running postgres 9.3.5 on the master and 9.3.10 on
> the replica,
Because of data loss bugs in 9.3.5 which are fixed in 9.3.10, I
recommend you update the master ASAP.
Yep, we're working on it.
> linux 3.2.0 on both,
There are scheduler problems in kernels at that level; I recommend
upgrading linux to a 3.8 or later kernel; 3.13 or later is best, if
possible.
Ditto.
I don't suppose you have vmstat 1 output from the incident? If it
happens again, try to capture that.
Are you looking for a stat in particular?
This is the idle of each core during the first incident:
The dark blue at the bottom is core0, which I believe is the one pgbouncer is using. There's a drop when the connections start running away, but there's still some headroom there, and the other cores are pretty much unaffected until the restart.
Have you tuned the OS vm.dirty_* settings? How about the
transparent_hugepage settings?
I think the vm dirty settings are default:
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
If this is a NUMA environment, what does this show?:
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 124999 MB
node 0 free: 740 MB
node distances:
node 0
0: 10
Carlo Cabanilla
Datadog
On Fri, Dec 11, 2015 at 3:37 PM, Carlo Cabanilla <carlo@datadoghq.com> wrote: > 16 cores > a default pool size of 650, steady state of 500-600 server > connections With so many more connections than resources to serve them, one thing that can happen is that just by happen-stance enough processes become busy at one time that they start context switching a lot before they finish, leaving spinlocks blocked and causing other contention that slows all query run times. This causes bloat to increase because some database transactions are left active for longer times. If the client software and/or pooler don't queue requests at that point there will be more connections made because connections have not been freed because of the contention causing slowness -- which exacerbates that problem and leads to a downward spiral. That can become so bad that there is no recovery until either the clients software is stopped or the database is restarted. >> I don't suppose you have vmstat 1 output from the incident? If >> it happens again, try to capture that. > > Are you looking for a stat in particular? Not really; what I like about `vmstat 1` is how many useful pieces of information are on each line, allowing me to get a good overview of what's going on. For example, if system CPU time is high, it is very likely to be a problem with transparent huge pages, which is one thing that can cause these symptoms. A "write glut" can also do so, which can be controlled by adjusting checkpoint and background writer settings, plus the OS vm.dirty_* settings (and maybe keeping shared_buffers smaller than you otherwise might). NUMA problems are not at issue, since there is only one memory node. Without more evidence of what is causing the problem, suggestions for a solution are shots in the dark. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Assuming you have at least 16GB of memory. These numbers on a good hardware server is not a real problem. On a bad servermotherboard. Might as well use a standard PC. With 32GB I have tested 10 times more connections. Not to postgresql. I would investigate everything from bottom up. Also under Tcp/Ip the flow and validity of the transaction is guaranteed. So I would look for other issues that is lockingsystem. For a good motherboard design check out Intel's motherboards. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Kevin Grittner Sent: 11 December 2015 22:13 To: Carlo Cabanilla Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] connections not getting closed on a replica On Fri, Dec 11, 2015 at 3:37 PM, Carlo Cabanilla <carlo@datadoghq.com> wrote: > 16 cores > a default pool size of 650, steady state of 500-600 server connections With so many more connections than resources to serve them, one thing that can happen is that just by happen-stance enoughprocesses become busy at one time that they start context switching a lot before they finish, leaving spinlocks blockedand causing other contention that slows all query run times. This causes bloat to increase because some databasetransactions are left active for longer times. If the client software and/or pooler don't queue requests at thatpoint there will be more connections made because connections have not been freed because of the contention causing slowness-- which exacerbates that problem and leads to a downward spiral. That can become so bad that there is no recoveryuntil either the clients software is stopped or the database is restarted. >> I don't suppose you have vmstat 1 output from the incident? If it >> happens again, try to capture that. > > Are you looking for a stat in particular? Not really; what I like about `vmstat 1` is how many useful pieces of information are on each line, allowing me to get agood overview of what's going on. For example, if system CPU time is high, it is very likely to be a problem with transparenthuge pages, which is one thing that can cause these symptoms. A "write glut" can also do so, which can be controlledby adjusting checkpoint and background writer settings, plus the OS vm.dirty_* settings (and maybe keeping shared_bufferssmaller than you otherwise might). NUMA problems are not at issue, since there is only one memory node. Without more evidence of what is causing the problem, suggestions for a solution are shots in the dark. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general