Thread: 100% cpu usage on some postmaster processes kill the complete database

100% cpu usage on some postmaster processes kill the complete database

From
Paul Dunkler
Date:
Hi List,

we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.

In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...

Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any update

I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:

top - 15:55:02 up 59 days, 37 min,  1 user,  load average: 35.95, 14.04, 7.32
Tasks: 2417 total,  54 running, 2363 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.3%us,  1.0%sy,  0.0%ni, 90.2%id,  1.9%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  264523700k total, 250145228k used, 14378472k free,   207032k buffers
Swap:  2097144k total,   553624k used,  1543520k free, 166905748k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
29852 postgres  20   0  131g  59m  35m R 100.0  0.0   1:27.71 postmaster        
29854 postgres  20   0  131g  70m  45m R 100.0  0.0   1:35.43 postmaster        
17449 postgres  20   0  131g 1.2g 1.2g R 100.0  0.5   1:52.62 postmaster        
29868 postgres  20   0  131g 1.1g 1.0g R 100.0  0.4   1:58.93 postmaster        
30136 postgres  20   0  131g  77m  52m R 100.0  0.0   1:34.33 postmaster        
30294 postgres  20   0  131g  66m  41m R 100.0  0.0   1:33.33 postmaster        
30864 postgres  20   0  131g  66m  41m R 100.0  0.0   1:36.17 postmaster        
30872 postgres  20   0  131g  61m  36m R 100.0  0.0   1:26.81 postmaster        
30876 postgres  20   0  131g  68m  43m R 100.0  0.0   1:33.97 postmaster        
30899 postgres  20   0  131g  68m  44m R 100.0  0.0   1:38.95 postmaster        
30906 postgres  20   0  131g  67m  42m R 100.0  0.0   1:27.82 postmaster        
31173 postgres  20   0  131g  68m  44m R 100.0  0.0   1:28.49 postmaster        
31239 postgres  20   0  131g  71m  46m R 100.0  0.0   1:31.42 postmaster        
31248 postgres  20   0  131g  90m  65m R 100.0  0.0   1:26.20 postmaster        
34934 postgres  20   0  131g 5580 3456 R 100.0  0.0   1:23.96 postmaster        
47945 postgres  20   0  131g 3.0g 3.0g R 100.0  1.2   6:08.41 postmaster        
16116 postgres  20   0  131g  84m  59m R 100.0  0.0   1:30.60 postmaster        
16304 postgres  20   0  131g  85m  60m R 100.0  0.0   1:38.89 postmaster        
17104 postgres  20   0  131g  96m  72m R 100.0  0.0   1:27.54 postmaster        
17111 postgres  20   0  131g  98m  73m R 100.0  0.0   1:38.23 postmaster        
17320 postgres  20   0  131g  98m  74m R 100.0  0.0   1:38.51 postmaster        
31221 postgres  20   0  131g  63m  38m R 100.0  0.0   1:33.63 postmaster        
31272 postgres  20   0  131g 1.0g 1.0g R 100.0  0.4   1:32.71 postmaster        
 3290 postgres  20   0  131g  99m  74m R 100.0  0.0   1:32.76 postmaster        
 3459 postgres  20   0  131g 2.1g 2.0g R 100.0  0.8   1:44.92 postmaster        
16492 postgres  20   0  131g 100m  75m R 100.0  0.0   1:33.36 postmaster        
16562 postgres  20   0  131g 114m  89m R 100.0  0.0   1:35.14 postmaster        
17146 postgres  20   0  131g  91m  66m R 100.0  0.0   1:37.39 postmaster        
17403 postgres  20   0  131g  98m  73m R 100.0  0.0   1:32.13 postmaster        
31100 postgres  20   0  131g  62m  38m R 100.0  0.0   1:29.06 postmaster        
 2019 postgres  20   0  131g 1.2g 1.2g R 98.7  0.5   1:40.91 postmaster         
 2150 postgres  20   0  131g 1.3g 1.3g R 98.7  0.5   2:53.14 postmaster         
16048 postgres  20   0  131g  71m  46m R 98.7  0.0   1:29.75 postmaster         
30190 postgres  20   0  131g 1.4g 1.3g R 98.7  0.5   0:55.98 postmaster         
16112 postgres  20   0  131g 862m 827m R 97.1  0.3   0:48.00 postmaster         
31202 postgres  20   0  131g  74m  49m R 97.1  0.0   1:34.62 postmaster         
35658 postgres  20   0  131g 5948 3788 R 97.1  0.0   0:12.29 postmaster         
16134 postgres  20   0  131g 1.9g 1.9g R 95.4  0.8   1:47.27 postmaster         
31034 postgres  20   0  131g  69m  44m R 95.4  0.0   1:26.35 postmaster         
16120 postgres  20   0  131g 1.2g 1.2g R 93.8  0.5   2:04.02 postmaster         
30891 postgres  20   0  131g  57m  33m R 93.8  0.0   1:23.08 postmaster         
31261 postgres  20   0  131g  81m  56m R 93.8  0.0   1:24.51 postmaster         
29790 postgres  20   0  131g  62m  37m R 92.2  0.0   1:35.34 postmaster         
30426 postgres  20   0  131g  62m  37m R 87.4  0.0   1:34.51 postmaster         
30857 postgres  20   0  131g  50m  26m R 79.3  0.0   1:37.82 postmaster         
  507 root      39  19     0    0    0 R 67.9  0.0  19:19.71 khugepaged         
16095 postgres  20   0  131g  83m  58m R 67.9  0.0   1:27.64 postmaster         
30856 postgres  20   0  131g  69m  44m R 67.9  0.0   1:34.46 postmaster         
17442 postgres  20   0  131g 2.4g 2.4g S 11.3  0.9   1:02.14 postmaster         


Postgresql Version information:
- PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6), 64-bit
- Running Hot Replication to another node (same hardware setup there)

Server Hardware:
- 4x 12 Core AMD Magny cours
- 256 GB of RAM (36% currently used)
- 1,3 TB SAS Raid (LSI Raid controller) - 15k rpm

If i lost to include some important informations for you analyzing my problem, let me please know. I did my best to post the question as accurate as possible for me.

--
Mit freundlichen Grüßen

Paul Dunkler

Re: 100% cpu usage on some postmaster processes kill the complete database

From
Richard Huxton
Date:
On 01/03/12 16:41, Paul Dunkler wrote:
> Hi List,
>
> we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index
scansper second. 
>
> In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running
with100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are
correctlyautomatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really
recognizewhat the problem could be there... 
>
> Some things i have checked:
> - We are not running any bulk jobs or maintenance scripts at this time
> - No system errors in any logs during that slowdowns
> - I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems
thatno postgres process can perform any update 
>
> I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post
asnippet of the top here: 

Combine that with this:
   SELECT * FROM pg_stat_activity;

That will let you line up pids from top with active queries.

--
   Richard Huxton
   Archonet Ltd

Re: 100% cpu usage on some postmaster processes kill the complete database

From
Paul Dunkler
Date:
I did that now - and analyzed the situation a bit. There are only queries running which will process very fast under high load (only index scans, very low rates of sequential scans). I found a remarkable number of Insert statements...

And sometimes when that happens, the CPU Utilization is going up to nearby 100% too and 98% is system usage...

Am 01.03.2012 um 18:02 schrieb Richard Huxton:

On 01/03/12 16:41, Paul Dunkler wrote:
Hi List,

we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.

In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...

Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any update

I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:

Combine that with this:
 SELECT * FROM pg_stat_activity;

That will let you line up pids from top with active queries.

--
 Richard Huxton
 Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Mit freundlichen Grüßen

Paul Dunkler





-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------

Attachment

Re: 100% cpu usage on some postmaster processes kill the complete database

From
Richard Huxton
Date:
On 01/03/12 19:41, Paul Dunkler wrote:
> I did that now - and analyzed the situation a bit. There are only queries
> running which will process very fast under high load (only index scans, very low
> rates of sequential scans). I found a remarkable number of Insert statements...
>
> And sometimes when that happens, the CPU Utilization is going up to nearby 100%
> too and 98% is system usage...

You're running on a box larger than I'm used to, so this is only
speculation. I'm wondering whether you're hitting problems with lock
contention or some such. It looks like you've got 48 cores there all at
about 100% possibly none of them getting much chance to do any work.

Oddly, the totals you posted in your top output show 6.3% user cpu
usage, which I can't make match with 50-odd processes all approaching
100% cpu.

Perhaps have a look at vmstat output too - see if context-switches spike
unusually high during these periods (sorry - no idea what an unusually
high number would be on a machine like yours).

Reducing the number of concurrent backends might help, but that rather
depends on whether my guess is right.

If no-one more experienced than me comes along shortly, try reposting to
the performance list. There are people there who are used to machines of
this size.

--
   Richard Huxton
   Archonet Ltd

Re: 100% cpu usage on some postmaster processes kill the complete database

From
Paul Dunkler
Date:
Hi,

You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problems with lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of them getting much chance to do any work.

Yes. That is what i see too...

Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processes all approaching 100% cpu.

Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i posted before.

Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - no idea what an unusually high number would be on a machine like yours).

Thanks. i will have a look at it.

Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right.

Yes... already thought about setting up a connection pool.

If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there who are used to machines of this size.

Thanks. I will wait a time and consider re-posting it to the perfornance list.


--
 Richard Huxton
 Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Mit freundlichen Grüßen

Paul Dunkler





-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------

Attachment

Re: 100% cpu usage on some postmaster processes kill the complete database

From
Scott Marlowe
Date:
On Thu, Mar 1, 2012 at 1:52 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
>
> Hi,
>
> You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting
problemswith lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of
themgetting much chance to do any work. 
>
>
> Yes. That is what i see too...
>
> Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd
processesall approaching 100% cpu. 
>
>
> Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i
postedbefore. 
>
> Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry -
noidea what an unusually high number would be on a machine like yours). 
>
>
> Thanks. i will have a look at it.
>
> Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right.
>
>
> Yes... already thought about setting up a connection pool.
>
> If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there
whoare used to machines of this size. 
>
>
> Thanks. I will wait a time and consider re-posting it to the perfornance list.
>

I'd look at vmstat and iostat output (vmstat 10, iostat -xd 10) for a
few minutes.  In vmstat look for high (>100k) ints or cs numbers, in
iostat look at io utilization.