Thread: 100% cpu usage on some postmaster processes kill the complete database
Hi List,
we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.
In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...
Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any update
I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:
top - 15:55:02 up 59 days, 37 min, 1 user, load average: 35.95, 14.04, 7.32Tasks: 2417 total, 54 running, 2363 sleeping, 0 stopped, 0 zombieCpu(s): 6.3%us, 1.0%sy, 0.0%ni, 90.2%id, 1.9%wa, 0.0%hi, 0.6%si, 0.0%stMem: 264523700k total, 250145228k used, 14378472k free, 207032k buffersSwap: 2097144k total, 553624k used, 1543520k free, 166905748k cachedPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND29852 postgres 20 0 131g 59m 35m R 100.0 0.0 1:27.71 postmaster29854 postgres 20 0 131g 70m 45m R 100.0 0.0 1:35.43 postmaster17449 postgres 20 0 131g 1.2g 1.2g R 100.0 0.5 1:52.62 postmaster29868 postgres 20 0 131g 1.1g 1.0g R 100.0 0.4 1:58.93 postmaster30136 postgres 20 0 131g 77m 52m R 100.0 0.0 1:34.33 postmaster30294 postgres 20 0 131g 66m 41m R 100.0 0.0 1:33.33 postmaster30864 postgres 20 0 131g 66m 41m R 100.0 0.0 1:36.17 postmaster30872 postgres 20 0 131g 61m 36m R 100.0 0.0 1:26.81 postmaster30876 postgres 20 0 131g 68m 43m R 100.0 0.0 1:33.97 postmaster30899 postgres 20 0 131g 68m 44m R 100.0 0.0 1:38.95 postmaster30906 postgres 20 0 131g 67m 42m R 100.0 0.0 1:27.82 postmaster31173 postgres 20 0 131g 68m 44m R 100.0 0.0 1:28.49 postmaster31239 postgres 20 0 131g 71m 46m R 100.0 0.0 1:31.42 postmaster31248 postgres 20 0 131g 90m 65m R 100.0 0.0 1:26.20 postmaster34934 postgres 20 0 131g 5580 3456 R 100.0 0.0 1:23.96 postmaster47945 postgres 20 0 131g 3.0g 3.0g R 100.0 1.2 6:08.41 postmaster16116 postgres 20 0 131g 84m 59m R 100.0 0.0 1:30.60 postmaster16304 postgres 20 0 131g 85m 60m R 100.0 0.0 1:38.89 postmaster17104 postgres 20 0 131g 96m 72m R 100.0 0.0 1:27.54 postmaster17111 postgres 20 0 131g 98m 73m R 100.0 0.0 1:38.23 postmaster17320 postgres 20 0 131g 98m 74m R 100.0 0.0 1:38.51 postmaster31221 postgres 20 0 131g 63m 38m R 100.0 0.0 1:33.63 postmaster31272 postgres 20 0 131g 1.0g 1.0g R 100.0 0.4 1:32.71 postmaster3290 postgres 20 0 131g 99m 74m R 100.0 0.0 1:32.76 postmaster3459 postgres 20 0 131g 2.1g 2.0g R 100.0 0.8 1:44.92 postmaster16492 postgres 20 0 131g 100m 75m R 100.0 0.0 1:33.36 postmaster16562 postgres 20 0 131g 114m 89m R 100.0 0.0 1:35.14 postmaster17146 postgres 20 0 131g 91m 66m R 100.0 0.0 1:37.39 postmaster17403 postgres 20 0 131g 98m 73m R 100.0 0.0 1:32.13 postmaster31100 postgres 20 0 131g 62m 38m R 100.0 0.0 1:29.06 postmaster2019 postgres 20 0 131g 1.2g 1.2g R 98.7 0.5 1:40.91 postmaster2150 postgres 20 0 131g 1.3g 1.3g R 98.7 0.5 2:53.14 postmaster16048 postgres 20 0 131g 71m 46m R 98.7 0.0 1:29.75 postmaster30190 postgres 20 0 131g 1.4g 1.3g R 98.7 0.5 0:55.98 postmaster16112 postgres 20 0 131g 862m 827m R 97.1 0.3 0:48.00 postmaster31202 postgres 20 0 131g 74m 49m R 97.1 0.0 1:34.62 postmaster35658 postgres 20 0 131g 5948 3788 R 97.1 0.0 0:12.29 postmaster16134 postgres 20 0 131g 1.9g 1.9g R 95.4 0.8 1:47.27 postmaster31034 postgres 20 0 131g 69m 44m R 95.4 0.0 1:26.35 postmaster16120 postgres 20 0 131g 1.2g 1.2g R 93.8 0.5 2:04.02 postmaster30891 postgres 20 0 131g 57m 33m R 93.8 0.0 1:23.08 postmaster31261 postgres 20 0 131g 81m 56m R 93.8 0.0 1:24.51 postmaster29790 postgres 20 0 131g 62m 37m R 92.2 0.0 1:35.34 postmaster30426 postgres 20 0 131g 62m 37m R 87.4 0.0 1:34.51 postmaster30857 postgres 20 0 131g 50m 26m R 79.3 0.0 1:37.82 postmaster507 root 39 19 0 0 0 R 67.9 0.0 19:19.71 khugepaged16095 postgres 20 0 131g 83m 58m R 67.9 0.0 1:27.64 postmaster30856 postgres 20 0 131g 69m 44m R 67.9 0.0 1:34.46 postmaster17442 postgres 20 0 131g 2.4g 2.4g S 11.3 0.9 1:02.14 postmaster
Postgresql Version information:
- PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6), 64-bit
- Running Hot Replication to another node (same hardware setup there)
Server Hardware:
- 4x 12 Core AMD Magny cours
- 256 GB of RAM (36% currently used)
- 1,3 TB SAS Raid (LSI Raid controller) - 15k rpm
If i lost to include some important informations for you analyzing my problem, let me please know. I did my best to post the question as accurate as possible for me.
--
Mit freundlichen Grüßen
Paul Dunkler
Mit freundlichen Grüßen
Paul Dunkler
Re: 100% cpu usage on some postmaster processes kill the complete database
From
Richard Huxton
Date:
On 01/03/12 16:41, Paul Dunkler wrote: > Hi List, > > we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scansper second. > > In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctlyautomatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognizewhat the problem could be there... > > Some things i have checked: > - We are not running any bulk jobs or maintenance scripts at this time > - No system errors in any logs during that slowdowns > - I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems thatno postgres process can perform any update > > I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post asnippet of the top here: Combine that with this: SELECT * FROM pg_stat_activity; That will let you line up pids from top with active queries. -- Richard Huxton Archonet Ltd
I did that now - and analyzed the situation a bit. There are only queries running which will process very fast under high load (only index scans, very low rates of sequential scans). I found a remarkable number of Insert statements...
-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------
And sometimes when that happens, the CPU Utilization is going up to nearby 100% too and 98% is system usage...
Am 01.03.2012 um 18:02 schrieb Richard Huxton:
On 01/03/12 16:41, Paul Dunkler wrote:Hi List,we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...Some things i have checked:- We are not running any bulk jobs or maintenance scripts at this time- No system errors in any logs during that slowdowns- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any updateI just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:
Combine that with this:
SELECT * FROM pg_stat_activity;
That will let you line up pids from top with active queries.
--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Mit freundlichen Grüßen
Paul Dunkler
Mit freundlichen Grüßen
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------
Attachment
Re: 100% cpu usage on some postmaster processes kill the complete database
From
Richard Huxton
Date:
On 01/03/12 19:41, Paul Dunkler wrote: > I did that now - and analyzed the situation a bit. There are only queries > running which will process very fast under high load (only index scans, very low > rates of sequential scans). I found a remarkable number of Insert statements... > > And sometimes when that happens, the CPU Utilization is going up to nearby 100% > too and 98% is system usage... You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problems with lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of them getting much chance to do any work. Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processes all approaching 100% cpu. Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - no idea what an unusually high number would be on a machine like yours). Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right. If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there who are used to machines of this size. -- Richard Huxton Archonet Ltd
Hi,
You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problems with lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of them getting much chance to do any work.
Yes. That is what i see too...
Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processes all approaching 100% cpu.
Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i posted before.
Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - no idea what an unusually high number would be on a machine like yours).
Thanks. i will have a look at it.
Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right.
Yes... already thought about setting up a connection pool.
If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there who are used to machines of this size.
Thanks. I will wait a time and consider re-posting it to the perfornance list.
--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Mit freundlichen Grüßen
Paul Dunkler
Mit freundlichen Grüßen
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------
Attachment
On Thu, Mar 1, 2012 at 1:52 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote: > > Hi, > > You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problemswith lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of themgetting much chance to do any work. > > > Yes. That is what i see too... > > Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processesall approaching 100% cpu. > > > Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i postedbefore. > > Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - noidea what an unusually high number would be on a machine like yours). > > > Thanks. i will have a look at it. > > Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right. > > > Yes... already thought about setting up a connection pool. > > If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there whoare used to machines of this size. > > > Thanks. I will wait a time and consider re-posting it to the perfornance list. > I'd look at vmstat and iostat output (vmstat 10, iostat -xd 10) for a few minutes. In vmstat look for high (>100k) ints or cs numbers, in iostat look at io utilization.