Thread: Sudden connection and load average spikes with postgresql 9.3

Sudden connection and load average spikes with postgresql 9.3

From
eudald_v
Date:
Hello all,
This is my very first message to the Postgresql community, and I really hope
you can help me solve the trouble I'm facing.

I've an 80 core server (multithread) with close to 500GB RAM.

My configuration is:
MaxConn: 1500 (was 850)
Shared buffers: 188Gb
work_mem: 110Mb (was 220Mb)
maintenance_work_mem: 256Mb
effective_cache_size: 340Gb

The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS
(build 3.13.0-55-generic)

Two days from now, I've been experiencing that, randomly, the connections
rise up till they reach max connections, and the load average of the server
goes arround 300~400, making every command issued on the server take
forever. When this happens, ram is relatively low (70Gb used), cores
activity is lower than usual and sometimes swap happens (I've swappiness
configured to 10%)

I've been trying to find the cause of this server underperformance, even
logging all queries in debug mode, but nothing strange found so far.

I really don't know which can be my next step to try to isolate the problem
and that's why I write to you guys. Have you ever seen this behaviour
before?
Could you kindly help me suggesting any step to follow?

Thanks and best regards!
E.v.



--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Sudden connection and load average spikes with postgresql 9.3

From
Tom Lane
Date:
eudald_v <reaven.galaeindael@gmail.com> writes:
> This is my very first message to the Postgresql community, and I really hope
> you can help me solve the trouble I'm facing.

> I've an 80 core server (multithread) with close to 500GB RAM.

> My configuration is:
> MaxConn: 1500 (was 850)
> Shared buffers: 188Gb
> work_mem: 110Mb (was 220Mb)
> maintenance_work_mem: 256Mb
> effective_cache_size: 340Gb

> The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS
> (build 3.13.0-55-generic)

> Two days from now, I've been experiencing that, randomly, the connections
> rise up till they reach max connections, and the load average of the server
> goes arround 300~400, making every command issued on the server take
> forever. When this happens, ram is relatively low (70Gb used), cores
> activity is lower than usual and sometimes swap happens (I've swappiness
> configured to 10%)

You haven't described why you would suddenly be getting more connections,
but if that's just an expected byproduct of transactions taking too long
to finish, then a plausible theory is that something is acquiring a strong
lock on some heavily-used table, causing other transactions to back up
behind it.  The pg_stat_activity and pg_locks views may help identify
the culprit.

In any case, a good mitigation plan would be to back off your
max_connections setting, and instead use a connection pooler to manage
large numbers of clients.  The usual rule of thumb is that you don't want
number of backends much more than the number of cores.  1500 active
sessions is WAY more than your hardware can support, and even approaching
that will exacerbate whatever the root performance problem is.

I'm also a bit worried by your focus on CPU capacity without any mention
of the disk subsystem.  It may well be that the actual bottleneck is disk
throughput, and that you can't even sustain 80 sessions unless they're
read-mostly.  It would be a good idea to watch the i/o stats the next
time you see one of these high-load episodes.

It might also be a good idea to decrease the shared_buffers setting and
rely more on OS-level disk caching.  Some have found that very large
shared buffer pools tend to increase contention without much payback.

            regards, tom lane


Re: Sudden connection and load average spikes with postgresql 9.3

From
eudald_v
Date:
Dear Tom,
Thanks for your fast approach.

First of all, yes, queries seems to take more time to process and they are
like queued up (you can even see inserts with status waiting on top/htop).

I didn't know about that connection tip, and I will absolutely find a moment
to add a pg_pooler to reduce the amount of active/idle connections shown at
the database.

I'm monitoring the pg_stat_activity table and when connections raise there's
no query being processed that could block a table and cause the other
queries to stack waiting. I'm checking pg_lock table too, with results to be
determined yet.

At least for the moment, it has spiked up once again (it's happening with
more frequency). I was able to catch this log from I/O. These are 24 300Gb
SAS disks on Raid 10 with hot spare.

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              42,00       136,00        96,00        136         96
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14,82    0,00   52,75    0,61    0,00   31,82
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             598,00      1764,00     15188,00       1764      15188
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17,79    0,00   82,19    0,01    0,00    0,01
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             230,00        28,00      6172,00         28       6172
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          26,79    0,00   72,49    0,12    0,00    0,60
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             200,00        96,00      8364,00         96       8364
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17,88    0,00   81,95    0,01    0,00    0,15
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              87,00        48,00      4164,00         48       4164
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          26,09    0,00   66,86    1,69    0,00    5,36
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             563,00      1932,00      9004,00       1932       9004
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19,30    0,00   80,38    0,10    0,00    0,21
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             159,00       348,00      5292,00        348       5292
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28,15    0,00   68,03    0,19    0,00    3,63
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             250,00       196,00      7388,00        196       7388
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,03    0,00   38,68    0,14    0,00   50,16
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             294,00      1508,00      2916,00       1508       2916

Hope it helps. If you feel we should reduce shared_buffers, what would be a
nice value?

Thanks!



--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5855900.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Sudden connection and load average spikes with postgresql 9.3

From
Josh Berkus
Date:
On 06/30/2015 07:52 AM, eudald_v wrote:
> Two days from now, I've been experiencing that, randomly, the connections
> rise up till they reach max connections, and the load average of the server
> goes arround 300~400, making every command issued on the server take
> forever. When this happens, ram is relatively low (70Gb used), cores
> activity is lower than usual and sometimes swap happens (I've swappiness
> configured to 10%)

As Tom said, the most likely reason for this is application behavior and
blocking locks.  Try some of these queries on our scripts page:

https://github.com/pgexperts/pgx_scripts/tree/master/locks

However, I have seem some other things which cause these kinds of stalls:

* runaway connection generation by the application, due to either a
programming bug or an irresponsible web crawler (see
https://www.pgexperts.com/blog/quinn_weaver/)

* issues evicting blocks from shared_buffers: what is your
shared_buffers set to?  How large is your database?

* Checkpoint stalls: what FS are you on?  What are your transaction log
settings for PostgreSQL?

* Issues with the software/hardware stack around your storage, causing
total IO stalls periodically.  What does IO throughput look like
before/during/after the stalls?

The last was the cause the last time I dealt with a situation like
yours; it turned out the issue was bad RAID card firmware where the card
would lock up whenever the write-through buffer got too much pressure.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Sudden connection and load average spikes with postgresql 9.3

From
eudald_v
Date:
Dear Josh,
I'm sorry I didn't write before, but we have been very busy with this issue
and, you know, when something goes wrong, the apocalypse comes with it.

I've been working on everything you suggested.

I used your tables and script and I can give you a sample of it on
locked_query_start
 2015-07-02 14:49:45.972129+02 |       15314 |          |       4001 |
| "TABLE_Z"     | tuple        | ExclusiveLock | 24018:24      | relation
| ShareUpdateExclusiveLock |              |             | YYY.YYY.YYY.YYY/32
| 2015-07-02 14:49:26.635599+02 | 2015-07-02 14:49:26.635599+02 | 2015-07-02
14:49:26.635601+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3",
"field4", "field5", "field6", "field7") VALUES (22359509, 92, 5, 88713,
'XXX.XXX.XXX.XXX', 199, 10)  |            |                  | 2015-07-02
14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active       |
2015-07-02 14:11:45.36871+02  | autovacuum: VACUUM ANALYZE public.TABLE_Z

 2015-07-02 14:49:45.972129+02 |       15857 |          |       4001 |
| "TABLE_Z"     | tuple        | ExclusiveLock | 24018:24      | relation
| ShareUpdateExclusiveLock |              |             | YYY.YYY.YYY.YYY/32
| 2015-07-02 14:49:22.79166+02  | 2015-07-02 14:49:22.79166+02  | 2015-07-02
14:49:22.791665+02 | INSERT INTO "TABLE_X" ("field1", "field2", "field3",
"field4", "field5", "field6", "field7") VALUES (14515978, 92, 5, 88713,
'XXX.XXX.XXX.XXX', 199, 10) |            |                  | 2015-07-02
14:11:45.368709+02 | 2015-07-02 14:11:45.368709+02 | active       |
2015-07-02 14:11:45.36871+02  | autovacuum: VACUUM ANALYZE public.TABLE_Z

2015-07-02 14:49:45.972129+02 |       15314 |          |      14712 |
| "TABLE_Z"     | tuple        | ExclusiveLock | 24018:24      | relation
| AccessShareLock          |              |             |
1YYY.YYY.YYY.YYY/32 | 2015-07-02 14:49:26.635599+02 | 2015-07-02
14:49:26.635599+02 | 2015-07-02 14:49:26.635601+02 | INSERT INTO "TABLE_X"
("field1", "field2", "field3", "field4", "field5", "field6", "field") VALUES
(22359509, 92, 5, 88713, 'XXX.XXX.XXX.XXX', 199, 10)  |            |
185.10.253.72/32 | 2015-07-02 14:48:48.841375+02 | 2015-07-02
14:48:48.841375+02 | active       | 2015-07-02 14:48:48.841384+02 | INSERT
INTO "TABLE_Y" ("email_id", "sendout_id", "feed_id", "isp_id")  VALUES
(46015879, 75471, 419, 0)

All that was recorded during a spike. From this log I have to point
something:
Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to
TABLE_Z
As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and
VACUUM work well together, just to check another perspective.

We also have carefully looked at our scripts and we have performed some code
optimitzations (like close db connections earlier), but the spikes continue
to happen.

FS is ext4 and I don't know how can I check the transaction log
configuration

This is how IO lookslike before and after any problem happens:
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             357,00      7468,00      8840,00       7468       8840
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5,02    0,00    2,44    0,06    0,00   92,47
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             471,00      7032,00     13760,00       7032      13760
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5,14    0,00    2,92    0,03    0,00   91,92
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             376,00      7192,00      8048,00       7192       8048
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4,77    0,00    2,57    0,03    0,00   92,63
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             304,00      7280,00      8252,00       7280       8252

And this is how it looks like when the spike happens:
http://pastebin.com/2hAYuDZ5

Hope it can help into determining what's happening.

Thanks for all your efforts and collaboration!
Eudald



--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856298.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Sudden connection and load average spikes with postgresql 9.3

From
Josh Berkus
Date:
On 07/02/2015 08:41 AM, eudald_v wrote:
> All that was recorded during a spike. From this log I have to point
> something:
> Tables TABLE_X and TABLE_Y have both a TRIGGER that does an INSERT to
> TABLE_Z
> As you can see, TABLE_Z was being VACUUM ANALYZED. I wonder if TRIGGERS and
> VACUUM work well together, just to check another perspective.

Well, it's not triggers in particular, but vacuum does create some
contention and possible sluggishness.  Questions:

* what kind of writes to the triggers do?
* can they conflict between sessions? that is, are different writes on X
and/or Y possibly overwriting the same rows on Z?
* is that autovacuum a regular autovacuum, or is it "to prevent wraparound"?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Sudden connection and load average spikes with postgresql 9.3

From
Josh Berkus
Date:
On 07/02/2015 08:41 AM, eudald_v wrote:
> And this is how it looks like when the spike happens:
> http://pastebin.com/2hAYuDZ5

Hmm, those incredibly high system % indicate that there's something
wrong with your system.  If you're not using software RAID or ZFS, you
should never see that.

I think you have a driver, kernel, Linux memory management, or IO stack
issue.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Sudden connection and load average spikes with postgresql 9.3

From
Kevin Grittner
Date:
Josh Berkus <josh@agliodbs.com> wrote:
> On 07/02/2015 08:41 AM, eudald_v wrote:

>> And this is how it looks like when the spike happens:
>> [system CPU staying over 90%]

> I think you have a driver, kernel, Linux memory management, or IO
> stack issue.

In my experience this is usually caused by failure to disable
transparent huge page support.  The larger shared_buffers is
configured, the bigger the problem.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sudden connection and load average spikes with postgresql 9.3

From
Scott Marlowe
Date:
On Tue, Jun 30, 2015 at 8:52 AM, eudald_v <reaven.galaeindael@gmail.com> wrote:
> Hello all,
> This is my very first message to the Postgresql community, and I really hope
> you can help me solve the trouble I'm facing.
>
> I've an 80 core server (multithread) with close to 500GB RAM.
>
> My configuration is:
> MaxConn: 1500 (was 850)

Drop this to 80 to 160, and use pgbouncer for pooling. pg_pooler is
nice, but it's a lot harder to configure. pgbouncer takes literally a
couple of minutes and you're up and running.

> Shared buffers: 188Gb

I have yet to see shared_buffers this big be a big help. I'd drop it
down to 1 to 16G or so but that's just me.

> work_mem: 110Mb (was 220Mb)

110MB*1500connections*1, 2, or 3 sorts per query == disaster. Drop
this down as well. If you have ONE query that needs a lot, create a
user for that query, alter that user for bigger work_mem and then use
it only for that big query.

> maintenance_work_mem: 256Mb
> effective_cache_size: 340Gb
>
> The database is running under postgresql 9.3.9 on an Ubuntu Server 14.04 LTS
> (build 3.13.0-55-generic)

Good kernel.

OK so you've got 80 cores. Have you checked zone_reclaim_mode? Under
no circumstances should that ever be turned on on a database server.

if "sysctl -a|grep zone" returns:
vm.zone_reclaim_mode = 1

then use /etc/sysctl.conf to set it to 0. It's a silent killer and it
will make your machine run REALLY slow for periods of 10 to 30 minutes
for no apparent reason.

But first and foremost GET A POOLER in the loop NOW. Not having one is
making your machine extremely vulnerable to overload, and making it
impossible for you to manage it.


Re: Sudden connection and load average spikes with postgresql 9.3

From
eudald_v
Date:
Hello guys!

I finally got rid of it.
It looks that at the end it was all due to transparent_hugepages values.

I disabled them and cpu spikes disappeared. I am sorry cause it's something
I usually disable on postgresql servers, but I forgot to do so on this one
and never thought about it.

Thanks a lot for all your helpful messages!

Eudald



--
View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Sudden connection and load average spikes with postgresql 9.3

From
Scott Marlowe
Date:
Note that if you still have the settings you showed in your original
post you're just moving the goal posts a few feet further back. Any
heavy load can still trigger this kind of behaviour.

On Tue, Jul 7, 2015 at 5:29 AM, eudald_v <reaven.galaeindael@gmail.com> wrote:
> Hello guys!
>
> I finally got rid of it.
> It looks that at the end it was all due to transparent_hugepages values.
>
> I disabled them and cpu spikes disappeared. I am sorry cause it's something
> I usually disable on postgresql servers, but I forgot to do so on this one
> and never thought about it.
>
> Thanks a lot for all your helpful messages!
>
> Eudald
>
>
>
> --
> View this message in context:
http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
To understand recursion, one must first understand recursion.