Thread: Multi processor server overloads occationally with system process while running postgresql-9.4

I am working in a public company who uses only open source applications and databases. I have a problem with our critical database which is write and read intensive. version: Postgresql-9.4 Hardware: HP DL980 (8-processor, 80 cores w/o hyper threading, 512GB RAM) Operating system: Red Hat Enterprise Linux Server release 6.4 (Santiago) uname -a : Linux host1 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux Single database with separate tablespace for main-data, pg_xlog and indexes I have a database having 770GB size and expected to grow to 2TB within the next year. The database was running in a 2processor HP DL560 (16 cores) and as the transactions of the database were found increasing, we have changed the hardware to DL980 with 8 processors and 512GB RAM. Problem It is observed that at some times during moderate load the CPU usage goes up to 400% and the users are not able to complete the queries in expected time. But the load is contributed by some system process only. The average connections are normally 50. But when this happens the connections will shoot up to max-connections. The sar command output 07:20:01 IST CPU %user %nice %system %iowait %steal %idle 07:30:01 IST all 0.73 0.00 0.37 0.58 0.00 98.33 07:40:01 IST all 0.66 0.00 0.38 0.65 0.00 98.31 07:50:01 IST all 0.27 0.00 0.27 0.01 0.00 99.45 08:00:01 IST all 0.52 0.00 0.37 0.01 0.00 99.10 08:10:01 IST all 1.54 0.00 0.70 0.02 0.00 97.74 08:20:01 IST all 1.20 0.00 0.67 0.02 0.00 98.10 08:30:01 IST all 1.48 0.00 0.77 0.03 0.00 97.72 08:40:01 IST all 1.69 0.00 0.89 0.04 0.00 97.39 08:50:01 IST all 1.71 0.00 0.94 0.04 0.00 97.31 09:00:01 IST all 1.74 0.00 0.92 0.03 0.00 97.31 09:10:01 IST all 2.32 0.00 1.06 0.04 0.00 96.58 09:20:01 IST all 2.22 0.00 1.17 0.04 0.00 96.57 09:30:02 IST all 2.20 0.00 6.68 0.06 0.00 91.06 09:40:01 IST all 2.43 0.00 1.37 0.06 0.00 96.14 09:50:01 IST all 3.23 0.00 2.06 0.08 0.00 94.63 10:00:02 IST all 3.15 0.00 6.10 0.07 0.00 90.67 10:10:01 IST all 4.94 0.00 5.20 0.29 0.00 89.57 10:20:01 IST all 5.10 0.00 2.13 0.34 0.00 92.43 10:30:01 IST all 5.60 0.00 2.42 0.18 0.00 91.80 10:40:01 IST all 5.28 0.00 14.37 0.19 0.00 80.16 10:50:01 IST all 4.52 0.00 28.48 0.23 0.00 66.77 11:00:01 IST all 5.25 0.00 9.02 0.18 0.00 85.55 11:10:01 IST all 5.77 0.00 4.96 0.27 0.00 89.00 11:20:01 IST all 5.70 0.00 2.74 0.19 0.00 91.37 11:30:01 IST all 5.72 0.00 5.91 0.20 0.00 88.17 11:40:01 IST all 5.66 0.00 2.81 0.37 0.00 91.15 11:50:01 IST all 5.90 0.00 8.80 0.10 0.00 85.19 12:00:01 IST all 6.44 0.00 3.40 0.13 0.00 90.03 12:10:01 IST all 7.18 0.00 4.52 0.11 0.00 88.18 12:20:02 IST all 4.40 0.00 37.84 0.07 0.00 57.70 12:30:01 IST all 5.66 0.00 2.98 0.10 0.00 91.26 12:40:01 IST all 5.74 0.00 3.05 0.11 0.00 91.10 Average: all 1.92 0.00 2.28 0.11 0.00 95.69 Postgresql.conf max_connections = 500 (can be reduced) shared_buffers = 8500MB work_mem = 50MB maintenance_work_mem = 8064MB checkpoint_segments = 132 checkpoint_timeout = 30min checkpoint_completion_target = 0.9 This over load happens 5-6 times a day. How to trace the cause of this problem?. My thoughts. 1. some thing related to the numa systems memory management. 2. Some thing related to the size of shared buffers. Please help Ajayakumar.BS

View this message in context: Multi processor server overloads occationally with system process while running postgresql-9.4
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 03/10/15 21:39, ajaykbs wrote:
> I am working in a public company who uses only open source
> applications and databases. I have a problem with our critical
> database which is write and read intensive. *version:* Postgresql-9.4
> *Hardware:* HP DL980 (8-processor, 80 cores w/o hyper threading, 512GB
> RAM) *Operating system: *Red Hat Enterprise Linux Server release 6.4
> (Santiago) *uname -a* : Linux host1 2.6.32-358.el6.x86_64 #1 SMP Tue
> Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux Single
> database with separate tablespace for main-data, pg_xlog and indexes I
> have a database having 770GB size and expected to grow to 2TB within
> the next year. The database was running in a 2processor HP DL560 (16
> cores) and as the transactions of the database were found increasing,
> we have changed the hardware to DL980 with 8 processors and 512GB RAM.
> *Problem* It is observed that at some times during moderate load the
> CPU usage goes up to 400% and the users are not able to complete the
> queries in expected time. But the load is contributed by some system
> process only. The average connections are normally 50. But when this
> happens the connections will shoot up to max-connections. *The sar
> command output* 07:20:01 IST CPU %user %nice %system %iowait %steal
> %idle 07:30:01 IST all 0.73 0.00 0.37 0.58 0.00 98.33 07:40:01 IST all
> 0.66 0.00 0.38 0.65 0.00 98.31 07:50:01 IST all 0.27 0.00 0.27 0.01
> 0.00 99.45 08:00:01 IST all 0.52 0.00 0.37 0.01 0.00 99.10 08:10:01
> IST all 1.54 0.00 0.70 0.02 0.00 97.74 08:20:01 IST all 1.20 0.00 0.67
> 0.02 0.00 98.10 08:30:01 IST all 1.48 0.00 0.77 0.03 0.00 97.72
> 08:40:01 IST all 1.69 0.00 0.89 0.04 0.00 97.39 08:50:01 IST all 1.71
> 0.00 0.94 0.04 0.00 97.31 09:00:01 IST all 1.74 0.00 0.92 0.03 0.00
> 97.31 09:10:01 IST all 2.32 0.00 1.06 0.04 0.00 96.58 09:20:01 IST all
> 2.22 0.00 1.17 0.04 0.00 96.57 09:30:02 IST all 2.20 0.00 6.68 0.06
> 0.00 91.06 09:40:01 IST all 2.43 0.00 1.37 0.06 0.00 96.14 09:50:01
> IST all 3.23 0.00 2.06 0.08 0.00 94.63 10:00:02 IST all 3.15 0.00 6.10
> 0.07 0.00 90.67 10:10:01 IST all 4.94 0.00 5.20 0.29 0.00 89.57
> 10:20:01 IST all 5.10 0.00 2.13 0.34 0.00 92.43 10:30:01 IST all 5.60
> 0.00 2.42 0.18 0.00 91.80 10:40:01 IST all 5.28 0.00 14.37 0.19 0.00
> 80.16 10:50:01 IST all 4.52 0.00 28.48 0.23 0.00 66.77 11:00:01 IST
> all 5.25 0.00 9.02 0.18 0.00 85.55 11:10:01 IST all 5.77 0.00 4.96
> 0.27 0.00 89.00 11:20:01 IST all 5.70 0.00 2.74 0.19 0.00 91.37
> 11:30:01 IST all 5.72 0.00 5.91 0.20 0.00 88.17 11:40:01 IST all 5.66
> 0.00 2.81 0.37 0.00 91.15 11:50:01 IST all 5.90 0.00 8.80 0.10 0.00
> 85.19 12:00:01 IST all 6.44 0.00 3.40 0.13 0.00 90.03 12:10:01 IST all
> 7.18 0.00 4.52 0.11 0.00 88.18 12:20:02 IST all 4.40 0.00 37.84 0.07
> 0.00 57.70 12:30:01 IST all 5.66 0.00 2.98 0.10 0.00 91.26 12:40:01
> IST all 5.74 0.00 3.05 0.11 0.00 91.10 Average: all 1.92 0.00 2.28
> 0.11 0.00 95.69 Postgresql.conf max_connections = 500 (can be reduced)
> shared_buffers = 8500MB work_mem = 50MB maintenance_work_mem = 8064MB
> checkpoint_segments = 132 checkpoint_timeout = 30min
> checkpoint_completion_target = 0.9 This over load happens 5-6 times a
> day. How to trace the cause of this problem?. My thoughts. 1. some
> thing related to the numa systems memory management. 2. Some thing
> related to the size of shared buffers. Please help Ajayakumar.BS
> ------------------------------------------------------------------------
> View this message in context: Multi processor server overloads
> occationally with system process while running postgresql-9.4
>
<http://postgresql.nabble.com/Multi-processor-server-overloads-occationally-with-system-process-while-running-postgresql-9-4-tp5868474.html>
> Sent from the PostgreSQL - performance mailing list archive
> <http://postgresql.nabble.com/PostgreSQL-performance-f2050081.html> at
> Nabble.com.
A little bit of formatting might make the above a bit more readable...
One paragraph is hard to parse.


-Gavin


Are you using any connection pooler in front of the database?

On 3 Oct 2015 17:04, "Gavin Flower" <GavinFlower@archidevsys.co.nz> wrote:
On 03/10/15 21:39, ajaykbs wrote:
I am working in a public company who uses only open source applications and databases. I have a problem with our critical database which is write and read intensive. *version:* Postgresql-9.4 *Hardware:* HP DL980 (8-processor, 80 cores w/o hyper threading, 512GB RAM) *Operating system: *Red Hat Enterprise Linux Server release 6.4 (Santiago) *uname -a* : Linux host1 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux Single database with separate tablespace for main-data, pg_xlog and indexes I have a database having 770GB size and expected to grow to 2TB within the next year. The database was running in a 2processor HP DL560 (16 cores) and as the transactions of the database were found increasing, we have changed the hardware to DL980 with 8 processors and 512GB RAM. *Problem* It is observed that at some times during moderate load the CPU usage goes up to 400% and the users are not able to complete the queries in expected time. But the load is contributed by some system process only. The average connections are normally 50. But when this happens the connections will shoot up to max-connections. *The sar command output* 07:20:01 IST CPU %user %nice %system %iowait %steal %idle 07:30:01 IST all 0.73 0.00 0.37 0.58 0.00 98.33 07:40:01 IST all 0.66 0.00 0.38 0.65 0.00 98.31 07:50:01 IST all 0.27 0.00 0.27 0.01 0.00 99.45 08:00:01 IST all 0.52 0.00 0.37 0.01 0.00 99.10 08:10:01 IST all 1.54 0.00 0.70 0.02 0.00 97.74 08:20:01 IST all 1.20 0.00 0.67 0.02 0.00 98.10 08:30:01 IST all 1.48 0.00 0.77 0.03 0.00 97.72 08:40:01 IST all 1.69 0.00 0.89 0.04 0.00 97.39 08:50:01 IST all 1.71 0.00 0.94 0.04 0.00 97.31 09:00:01 IST all 1.74 0.00 0.92 0.03 0.00 97.31 09:10:01 IST all 2.32 0.00 1.06 0.04 0.00 96.58 09:20:01 IST all 2.22 0.00 1.17 0.04 0.00 96.57 09:30:02 IST all 2.20 0.00 6.68 0.06 0.00 91.06 09:40:01 IST all 2.43 0.00 1.37 0.06 0.00 96.14 09:50:01 IST all 3.23 0.00 2.06 0.08 0.00 94.63 10:00:02 IST all 3.15 0.00 6.10 0.07 0.00 90.67 10:10:01 IST all 4.94 0.00 5.20 0.29 0.00 89.57 10:20:01 IST all 5.10 0.00 2.13 0.34 0.00 92.43 10:30:01 IST all 5.60 0.00 2.42 0.18 0.00 91.80 10:40:01 IST all 5.28 0.00 14.37 0.19 0.00 80.16 10:50:01 IST all 4.52 0.00 28.48 0.23 0.00 66.77 11:00:01 IST all 5.25 0.00 9.02 0.18 0.00 85.55 11:10:01 IST all 5.77 0.00 4.96 0.27 0.00 89.00 11:20:01 IST all 5.70 0.00 2.74 0.19 0.00 91.37 11:30:01 IST all 5.72 0.00 5.91 0.20 0.00 88.17 11:40:01 IST all 5.66 0.00 2.81 0.37 0.00 91.15 11:50:01 IST all 5.90 0.00 8.80 0.10 0.00 85.19 12:00:01 IST all 6.44 0.00 3.40 0.13 0.00 90.03 12:10:01 IST all 7.18 0.00 4.52 0.11 0.00 88.18 12:20:02 IST all 4.40 0.00 37.84 0.07 0.00 57.70 12:30:01 IST all 5.66 0.00 2.98 0.10 0.00 91.26 12:40:01 IST all 5.74 0.00 3.05 0.11 0.00 91.10 Average: all 1.92 0.00 2.28 0.11 0.00 95.69 Postgresql.conf max_connections = 500 (can be reduced) shared_buffers = 8500MB work_mem = 50MB maintenance_work_mem = 8064MB checkpoint_segments = 132 checkpoint_timeout = 30min checkpoint_completion_target = 0.9 This over load happens 5-6 times a day. How to trace the cause of this problem?. My thoughts. 1. some thing related to the numa systems memory management. 2. Some thing related to the size of shared buffers. Please help Ajayakumar.BS
------------------------------------------------------------------------
View this message in context: Multi processor server overloads occationally with system process while running postgresql-9.4 <http://postgresql.nabble.com/Multi-processor-server-overloads-occationally-with-system-process-while-running-postgresql-9-4-tp5868474.html>
Sent from the PostgreSQL - performance mailing list archive <http://postgresql.nabble.com/PostgreSQL-performance-f2050081.html> at Nabble.com.
A little bit of formatting might make the above a bit more readable...  One paragraph is hard to parse.


-Gavin


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 2015-10-03 01:39:33 -0700, ajaykbs wrote:
> It is observed that at some times during moderate load
> the CPU usage goes up to 400% and the users are not able to complete the
> queries in expected time. But the load is contributed by some system process
> only.The average connections are normally 50.

This email is nearly impossible to read.

But it sounds a bit like you need to disable transparent hugepages
and/or zone_reclaim mode.

Greetings,

Andres Freund


Sorry about the formatting.
I am posting the same lines again.

I am working in a public company who uses only open source applications and
databases. I have a problem with our critical database which is write and
read intensive.

version: Postgresql-9.4
 Hardware: HP DL980 (8-processor, 80 cores w/o hyper threading, 512GB RAM)
Operating system: Red Hat Enterprise Linux Server release 6.4 (Santiago)
uname -a : Linux host1 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST
2013 x86_64 x86_64 x86_64 GNU/Linux Single database with separate tablespace
for main-data, pg_xlog and indexes

I have a database having 770GB size and expected to grow to 2TB within the
next year. The database was running in a 2processor HP DL560 (16 cores) and
as the transactions of the database were found increasing, we have changed
the hardware to DL980 with 8 processors and 512GB RAM.

 Problem It is observed that at some times during moderate load the CPU
usage goes up to 400% and the users are not able to complete the queries in
expected time. But the load is contributed by some system process only. The
average connections are normally 50. But when this happens the connections
will shoot up to max-connections.

sar command output

07:20:01  IST     CPU     %user     %nice   %system   %iowait    %steal
%idle
07:30:01  IST     all      0.73      0.00      0.37      0.58      0.00
98.33
07:40:01  IST     all      0.66      0.00      0.38      0.65      0.00
98.31
07:50:01  IST     all      0.27      0.00      0.27      0.01      0.00
99.45
08:00:01  IST     all      0.52      0.00      0.37      0.01      0.00
99.10
08:10:01  IST     all      1.54      0.00      0.70      0.02      0.00
97.74
08:20:01  IST     all      1.20      0.00      0.67      0.02      0.00
98.10
08:30:01  IST     all      1.48      0.00      0.77      0.03      0.00
97.72
08:40:01  IST     all      1.69      0.00      0.89      0.04      0.00
97.39
08:50:01  IST     all      1.71      0.00      0.94      0.04      0.00
97.31
09:00:01  IST     all      1.74      0.00      0.92      0.03      0.00
97.31
09:10:01  IST     all      2.32      0.00      1.06      0.04      0.00
96.58
09:20:01  IST     all      2.22      0.00      1.17      0.04      0.00
96.57
09:30:02  IST     all      2.20      0.00      6.68      0.06      0.00
91.06
09:40:01  IST     all      2.43      0.00      1.37      0.06      0.00
96.14
09:50:01  IST     all      3.23      0.00      2.06      0.08      0.00
94.63
10:00:02  IST     all      3.15      0.00      6.10      0.07      0.00
90.67
10:10:01  IST     all      4.94      0.00      5.20      0.29      0.00
89.57
10:20:01  IST     all      5.10      0.00      2.13      0.34      0.00
92.43
10:30:01  IST     all      5.60      0.00      2.42      0.18      0.00
91.80
10:40:01  IST     all      5.28      0.00     14.37      0.19      0.00
80.16
10:50:01  IST     all      4.52      0.00     28.48      0.23      0.00
66.77
11:00:01  IST     all      5.25      0.00      9.02      0.18      0.00
85.55
11:10:01  IST     all      5.77      0.00      4.96      0.27      0.00
89.00
11:20:01  IST     all      5.70      0.00      2.74      0.19      0.00
91.37
11:30:01  IST     all      5.72      0.00      5.91      0.20      0.00
88.17
11:40:01  IST     all      5.66      0.00      2.81      0.37      0.00
91.15
11:50:01  IST     all      5.90      0.00      8.80      0.10      0.00
85.19
12:00:01  IST     all      6.44      0.00      3.40      0.13      0.00
90.03
12:10:01  IST     all      7.18      0.00      4.52      0.11      0.00
88.18
12:20:02  IST     all      4.40      0.00     37.84      0.07      0.00
57.70
12:30:01  IST     all      5.66      0.00      2.98      0.10      0.00
91.26
12:40:01  IST     all      5.74      0.00      3.05      0.11      0.00
91.10
Average:        all      1.92      0.00      2.28      0.11      0.00
95.69


Postgresql.conf
max_connections = 500 (can be reduced)
shared_buffers = 8500MB work_mem = 50MB
maintenance_work_mem = 8064MB
checkpoint_segments = 132
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9

I am not using a connection pooler.

This over load happens 5-6 times a day. How to trace the cause of this
problem?.

My thoughts.
1. some thing related to the numa systems memory management.
2. Some thing related to the size of shared buffers. Please help



Ajayakumar.BS



--
View this message in context:
http://postgresql.nabble.com/Multi-processor-server-overloads-occationally-with-system-process-while-running-postgresql-9-4-tp5868474p5868480.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


I have checked the transparent huge pages and zone reclaim mode and those are
already disabled.

As a trial and error method, I have reduced the shared buffer size from
8500MB to 3000MB.
The CPU i/o wait is icreased a little. But the periodical over load has not
occurred afterwards. (3 days passed without such situation). I shall report
further developments.
  Thank you all for the great help.



--
View this message in context:
http://postgresql.nabble.com/Multi-processor-server-overloads-occationally-with-system-process-while-running-postgresql-9-4-tp5868474p5869047.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


On Tue, Oct 6, 2015 at 11:08 PM, ajaykbs <ajayakumarbs@gmail.com> wrote:
> I have checked the transparent huge pages and zone reclaim mode and those are
> already disabled.
>
> As a trial and error method, I have reduced the shared buffer size from
> 8500MB to 3000MB.
> The CPU i/o wait is icreased a little. But the periodical over load has not
> occurred afterwards. (3 days passed without such situation). I shall report
> further developments.

Reduce max connections to something more reasonable like < 100 and get
a connection pooler in place (pgbouncer is simple to setup and use)


On Saturday, October 3, 2015 4:36 AM, ajaykbs <ajayakumarbs@gmail.com> wrote:

> version: Postgresql-9.4
> Hardware: HP DL980 (8-processor, 80 cores w/o hyper threading, 512GB RAM)
> Operating system: Red Hat Enterprise Linux Server release 6.4 (Santiago)
> uname -a : Linux host1 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST
> 2013 x86_64 x86_64 x86_64 GNU/Linux Single database with separate tablespace
> for main-data, pg_xlog and indexes
>
> I have a database having 770GB size and expected to grow to 2TB within the
> next year. The database was running in a 2processor HP DL560 (16 cores) and
> as the transactions of the database were found increasing, we have changed
> the hardware to DL980 with 8 processors and 512GB RAM.
>
> Problem It is observed that at some times during moderate load the CPU
> usage goes up to 400% and the users are not able to complete the queries in
> expected time. But the load is contributed by some system process only. The
> average connections are normally 50. But when this happens the connections
> will shoot up to max-connections.

You might find this thread interesting:

http://www.postgresql.org/message-id/flat/55783940.8080302@wi3ck.info#55783940.8080302@wi3ck.info

The short version is that in existing production versions you can
easily run in to such symptoms when you get to 8 or more CPU
packages.  The problem seems to be solved in the development
versions of 9.5 (with changes not suitable for back-patching to a
stable branch).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company