Thread: Sudden connection and load average spikes with postgresql 9.3
Hello all, This is my very first message to the Postgresql community, and I really hope you can help me solve the trouble I'm facing. I've an 80 core server (multithread) with close to 500GB RAM. My configuration is: MaxConn: 1500 (was 850) Shared buffers: 188Gb work_mem: 110Mb (was 220Mb) maintenance_work_mem: 256Mb effective_cache_size: 340Gb The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS (build 3.13.0-55-generic) Two days from now, I've been experiencing that, randomly, the connections rise up till they reach max connections, and the load average of the server goes arround 300~400, making every command issued on the server take forever. When this happens, ram is relatively low (70Gb used), cores activity is lower than usual and sometimes swap happens (I've swappiness configured to 10%) I've been trying to find the cause of this server underperformance, even logging all queries in debug mode, but nothing strange found so far. I really don't know which can be my next step to try to isolate the problem and that's why I write to you guys. Have you ever seen this behaviour before? Could you kindly help me suggesting any step to follow? Thanks and best regards! E.v. -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
eudald_v <reaven.galaeindael@gmail.com> writes: > This is my very first message to the Postgresql community, and I really hope > you can help me solve the trouble I'm facing. > I've an 80 core server (multithread) with close to 500GB RAM. > My configuration is: > MaxConn: 1500 (was 850) > Shared buffers: 188Gb > work_mem: 110Mb (was 220Mb) > maintenance_work_mem: 256Mb > effective_cache_size: 340Gb > The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS > (build 3.13.0-55-generic) > Two days from now, I've been experiencing that, randomly, the connections > rise up till they reach max connections, and the load average of the server > goes arround 300~400, making every command issued on the server take > forever. When this happens, ram is relatively low (70Gb used), cores > activity is lower than usual and sometimes swap happens (I've swappiness > configured to 10%) You haven't described why you would suddenly be getting more connections, but if that's just an expected byproduct of transactions taking too long to finish, then a plausible theory is that something is acquiring a strong lock on some heavily-used table, causing other transactions to back up behind it. The pg_stat_activity and pg_locks views may help identify the culprit. In any case, a good mitigation plan would be to back off your max_connections setting, and instead use a connection pooler to manage large numbers of clients. The usual rule of thumb is that you don't want number of backends much more than the number of cores. 1500 active sessions is WAY more than your hardware can support, and even approaching that will exacerbate whatever the root performance problem is. I'm also a bit worried by your focus on CPU capacity without any mention of the disk subsystem. It may well be that the actual bottleneck is disk throughput, and that you can't even sustain 80 sessions unless they're read-mostly. It would be a good idea to watch the i/o stats the next time you see one of these high-load episodes. It might also be a good idea to decrease the shared_buffers setting and rely more on OS-level disk caching. Some have found that very large shared buffer pools tend to increase contention without much payback. regards, tom lane
Dear Tom, Thanks for your fast approach. First of all, yes, queries seems to take more time to process and they are like queued up (you can even see inserts with status waiting on top/htop). I didn't know about that connection tip, and I will absolutely find a moment to add a pg_pooler to reduce the amount of active/idle connections shown at the database. I'm monitoring the pg_stat_activity table and when connections raise there's no query being processed that could block a table and cause the other queries to stack waiting. I'm checking pg_lock table too, with results to be determined yet. At least for the moment, it has spiked up once again (it's happening with more frequency). I was able to catch this log from I/O. These are 24 300Gb SAS disks on Raid 10 with hot spare. Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 42,00 136,00 96,00 136 96 avg-cpu: %user %nice %system %iowait %steal %idle 14,82 0,00 52,75 0,61 0,00 31,82 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 598,00 1764,00 15188,00 1764 15188 avg-cpu: %user %nice %system %iowait %steal %idle 17,79 0,00 82,19 0,01 0,00 0,01 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 230,00 28,00 6172,00 28 6172 avg-cpu: %user %nice %system %iowait %steal %idle 26,79 0,00 72,49 0,12 0,00 0,60 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 200,00 96,00 8364,00 96 8364 avg-cpu: %user %nice %system %iowait %steal %idle 17,88 0,00 81,95 0,01 0,00 0,15 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 87,00 48,00 4164,00 48 4164 avg-cpu: %user %nice %system %iowait %steal %idle 26,09 0,00 66,86 1,69 0,00 5,36 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 563,00 1932,00 9004,00 1932 9004 avg-cpu: %user %nice %system %iowait %steal %idle 19,30 0,00 80,38 0,10 0,00 0,21 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 159,00 348,00 5292,00 348 5292 avg-cpu: %user %nice %system %iowait %steal %idle 28,15 0,00 68,03 0,19 0,00 3,63 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 250,00 196,00 7388,00 196 7388 avg-cpu: %user %nice %system %iowait %steal %idle 11,03 0,00 38,68 0,14 0,00 50,16 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 294,00 1508,00 2916,00 1508 2916 Hope it helps. If you feel we should reduce shared_buffers, what would be a nice value? Thanks! -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5855900.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 06/30/2015 07:52 AM, eudald_v wrote: > Two days from now, I've been experiencing that, randomly, the connections > rise up till they reach max connections, and the load average of the server > goes arround 300~400, making every command issued on the server take > forever. When this happens, ram is relatively low (70Gb used), cores > activity is lower than usual and sometimes swap happens (I've swappiness > configured to 10%) As Tom said, the most likely reason for this is application behavior and blocking locks. Try some of these queries on our scripts page: https://github.com/pgexperts/pgx_scripts/tree/master/locks However, I have seem some other things which cause these kinds of stalls: * runaway connection generation by the application, due to either a programming bug or an irresponsible web crawler (see https://www.pgexperts.com/blog/quinn_weaver/) * issues evicting blocks from shared_buffers: what is your shared_buffers set to? How large is your database? * Checkpoint stalls: what FS are you on? What are your transaction log settings for PostgreSQL? * Issues with the software/hardware stack around your storage, causing total IO stalls periodically. What does IO throughput look like before/during/after the stalls? The last was the cause the last time I dealt with a situation like yours; it turned out the issue was bad RAID card firmware where the card would lock up whenever the write-through buffer got too much pressure. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Dear Josh, I'm sorry I didn't write before, but we have been very busy with this issue and, you know, when something goes wrong, the apocalypse comes with it. I've been working on everything you suggested. I used your tables and script and I can give you a sample of it on locked_query_start 2015-07-02 14:49:45.972129+02 | 15314 | | 4001 | | "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation | ShareUpdateExclusiveLock | | | YYY.YYY.YYY.YYY/32 | 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635601+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3", "field4", "field5", "field6", "field7") VALUES (22359509, 92, 5, 88713, 'XXX.XXX.XXX.XXX', 199, 10) | | | 2015-07-02 14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active | 2015-07-02 14:11:45.36871+02 | autovacuum: VACUUM ANALYZE public.TABLE_Z 2015-07-02 14:49:45.972129+02 | 15857 | | 4001 | | "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation | ShareUpdateExclusiveLock | | | YYY.YYY.YYY.YYY/32 | 2015-07-02 14:49:22.79166+02 | 2015-07-02 14:49:22.79166+02 | 2015-07-02 14:49:22.791665+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3", "field4", "field5", "field6", "field7") VALUES (14515978, 92, 5, 88713, 'XXX.XXX.XXX.XXX', 199, 10) | | | 2015-07-02 14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active | 2015-07-02 14:11:45.36871+02 | autovacuum: VACUUM ANALYZE public.TABLE_Z 2015-07-02 14:49:45.972129+02 | 15314 | | 14712 | | "TABLE_Z" | tuple | ExclusiveLock | 24018:24 | relation | AccessShareLock | | | 1YYY.YYY.YYY.YYY/32 | 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635601+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3", "field4", "field5", "field6", "field") VALUES (22359509, 92, 5, 88713, 'XXX.XXX.XXX.XXX', 199, 10) | | 185.10.253.72/32 | 2015-07-02 14:48:48.841375+02 | 2015-07-02 14:48:48.841375+02 | active | 2015-07-02 14:48:48.841384+02 | INSERT INTO "TABLE_Y" ("email_id", "sendout_id", "feed_id", "isp_id") VALUES (46015879, 75471, 419, 0) All that was recorded during a spike. From this log I have to point something: Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to TABLE_Z As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and VACUUM work well together, just to check another perspective. We also have carefully looked at our scripts and we have performed some code optimitzations (like close db connections earlier), but the spikes continue to happen. FS is ext4 and I don't know how can I check the transaction log configuration This is how IO lookslike before and after any problem happens: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 357,00 7468,00 8840,00 7468 8840 avg-cpu: %user %nice %system %iowait %steal %idle 5,02 0,00 2,44 0,06 0,00 92,47 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 471,00 7032,00 13760,00 7032 13760 avg-cpu: %user %nice %system %iowait %steal %idle 5,14 0,00 2,92 0,03 0,00 91,92 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 376,00 7192,00 8048,00 7192 8048 avg-cpu: %user %nice %system %iowait %steal %idle 4,77 0,00 2,57 0,03 0,00 92,63 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 304,00 7280,00 8252,00 7280 8252 And this is how it looks like when the spike happens: http://pastebin.com/2hAYuDZ5 Hope it can help into determining what's happening. Thanks for all your efforts and collaboration! Eudald -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856298.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 07/02/2015 08:41 AM, eudald_v wrote: > All that was recorded during a spike. From this log I have to point > something: > Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to > TABLE_Z > As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and > VACUUM work well together, just to check another perspective. Well, it's not triggers in particular, but vacuum does create some contention and possible sluggishness. Questions: * what kind of writes to the triggers do? * can they conflict between sessions? that is, are different writes on X and/or Y possibly overwriting the same rows on Z? * is that autovacuum a regular autovacuum, or is it "to prevent wraparound"? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 07/02/2015 08:41 AM, eudald_v wrote: > And this is how it looks like when the spike happens: > http://pastebin.com/2hAYuDZ5 Hmm, those incredibly high system % indicate that there's something wrong with your system. If you're not using software RAID or ZFS, you should never see that. I think you have a driver, kernel, Linux memory management, or IO stack issue. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> wrote: > On 07/02/2015 08:41 AM, eudald_v wrote: >> And this is how it looks like when the spike happens: >> [system CPU staying over 90%] > I think you have a driver, kernel, Linux memory management, or IO > stack issue. In my experience this is usually caused by failure to disable transparent huge page support. The larger shared_buffers is configured, the bigger the problem. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 30, 2015 at 8:52 AM, eudald_v <reaven.galaeindael@gmail.com> wrote: > Hello all, > This is my very first message to the Postgresql community, and I really hope > you can help me solve the trouble I'm facing. > > I've an 80 core server (multithread) with close to 500GB RAM. > > My configuration is: > MaxConn: 1500 (was 850) Drop this to 80 to 160, and use pgbouncer for pooling. pg_pooler is nice, but it's a lot harder to configure. pgbouncer takes literally a couple of minutes and you're up and running. > Shared buffers: 188Gb I have yet to see shared_buffers this big be a big help. I'd drop it down to 1 to 16G or so but that's just me. > work_mem: 110Mb (was 220Mb) 110MB*1500connections*1, 2, or 3 sorts per query == disaster. Drop this down as well. If you have ONE query that needs a lot, create a user for that query, alter that user for bigger work_mem and then use it only for that big query. > maintenance_work_mem: 256Mb > effective_cache_size: 340Gb > > The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS > (build 3.13.0-55-generic) Good kernel. OK so you've got 80 cores. Have you checked zone_reclaim_mode? Under no circumstances should that ever be turned on on a database server. if "sysctl -a|grep zone" returns: vm.zone_reclaim_mode = 1 then use /etc/sysctl.conf to set it to 0. It's a silent killer and it will make your machine run REALLY slow for periods of 10 to 30 minutes for no apparent reason. But first and foremost GET A POOLER in the loop NOW. Not having one is making your machine extremely vulnerable to overload, and making it impossible for you to manage it.
Hello guys! I finally got rid of it. It looks that at the end it was all due to transparent_hugepages values. I disabled them and cpu spikes disappeared. I am sorry cause it's something I usually disable on postgresql servers, but I forgot to do so on this one and never thought about it. Thanks a lot for all your helpful messages! Eudald -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Note that if you still have the settings you showed in your original post you're just moving the goal posts a few feet further back. Any heavy load can still trigger this kind of behaviour. On Tue, Jul 7, 2015 at 5:29 AM, eudald_v <reaven.galaeindael@gmail.com> wrote: > Hello guys! > > I finally got rid of it. > It looks that at the end it was all due to transparent_hugepages values. > > I disabled them and cpu spikes disappeared. I am sorry cause it's something > I usually disable on postgresql servers, but I forgot to do so on this one > and never thought about it. > > Thanks a lot for all your helpful messages! > > Eudald > > > > -- > View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html > Sent from the PostgreSQL - performance mailing list archive at Nabble.com. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion.