Thread: 回复: postgres cpu 100% need help

回复: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

27 October 2015, 03:31:47

Dear sir:

Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.

Here is the output when I strace the one of process when its cpu usages grow up to 100%:

recvfrom(10, "P\0\0\0J\0 \t\tselect o_count from tsh"..., 8192, 0, NULL, NULL) = 120
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488

....more and more lseek(16, 0, SEEK_END) = 180543488 Occasionally select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1o_count\0\0\0\0\0\0\0\0"..., 75, 0, NULL, 0) = 75
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183

ps:my postgres config:

Are those problems with the configuration?

qijia.wang@melot.cn

发件人： 657985552@qq.com
发送时间： 2015-10-26 16:14
收件人： 王祺佳
主题： postgres cpu 100% need help
hello everyone:
I postgresql version is 9.3.5 ,My database is oltp server normally cpu usage is 3%~10% . Every sql is return in 20ms.Concurrently activety sql is less then 5
but sometimes my database cpu grows 90%+ ,the simple select sql use 2000+ms .

select count(*) from pg_stat_activity where state<>'idle';
count
-------
126

I strace -p 86181(the one of process when cpu is 100%)
recvfrom(10, "Q\0\0\0?select * from   tshow.p_hpa"..., 8192, 0, NULL, NULL) = 64
lseek(8, 0, SEEK_END)                   = 32768
lseek(7, 0, SEEK_END)                   = 1155072
lseek(7, 0, SEEK_END)                   = 1155072
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)

zhiwei.li

Re: ??: postgres cpu 100% need help

From

Bill Moran

Date:

27 October 2015, 17:14:34

On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:

> Dear sir:
>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help.
Hereis the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu
usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its
processcan suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5
andthe database is oltp  server. 

9.3.5 is pretty old, you should probably schedule an upgrade.

>  shared_buffers                                     | 25GB

Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.

If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.

--
Bill Moran

Re: ??: postgres cpu 100% need help

From

Merlin Moncure

Date:

27 October 2015, 18:35:39

On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help.
Hereis the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu
usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its
processcan suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5
andthe database is oltp  server. 
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>>  shared_buffers                                     | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.

possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)

maybe some other things I'm not thinking of.

merlin

Re: [GENERAL] 回复: postgres cpu 100% need help

From

Jeff Janes

Date:

27 October 2015, 19:33:45

On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:

Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.

From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop. What is the query it is running, and what is the execution plan for it? To get that, run:

explain (analyze, buffers) select o_count from ....

Cheers,

Jeff

Re: [GENERAL] Re: [GENERAL] 回复: postgres cpu 100% need help

From

Merlin Moncure

Date:

27 October 2015, 21:04:24

On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
>>
>> Dear sir:
>>          Recently a wired question about postgresql database really
>> bothered me a lot, so i really need your help. Here is the problem, in the
>> most situations the postgre database work very well,  Average 3500tps/s per
>> day, the cpu usage of its process is 3%~10% and every query can be responsed
>> in less than 20ms, but sometimes the cpu usages of its process can suddenly
>> grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My
>> postgresql version is 9.3.5 and the database is oltp  server.
>
>
> From all the lseeks on the same file, it looks like you are doing a
> sequential scan on a table for the inside of a nested loop.  What is the
> query it is running, and what is the execution plan for it?  To get that,
> run:

this smells like thp compaction:

Re: [GENERAL] Re: [GENERAL] 回复: postgres cpu 100% need help

From

Merlin Moncure

Date:

27 October 2015, 21:05:21

On Tue, Oct 27, 2015 at 4:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
>>>
>>> Dear sir:
>>>          Recently a wired question about postgresql database really
>>> bothered me a lot, so i really need your help. Here is the problem, in the
>>> most situations the postgre database work very well,  Average 3500tps/s per
>>> day, the cpu usage of its process is 3%~10% and every query can be responsed
>>> in less than 20ms, but sometimes the cpu usages of its process can suddenly
>>> grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My
>>> postgresql version is 9.3.5 and the database is oltp  server.
>>
>>
>> From all the lseeks on the same file, it looks like you are doing a
>> sequential scan on a table for the inside of a nested loop.  What is the
>> query it is running, and what is the execution plan for it?  To get that,
>> run:
>
> this smells like thp compaction:

drat -- sorry for partial email.
check this link:
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

if that turns out not to be productive, need to check 'perf top', and
also rule out problems with individual queries.

merlin

Re: ??: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

28 October 2015, 03:02:21

Thank you for your reply. i will try to change it and Waiting for it to appear again.the sql is :

select count(t.*) into o_count from tshow.res_room_weight t,tshow.res_room_info r
  where t.subcatlg_id=:i_title_id
  and t.roomid = r.actorid
  and r.levels>=0;

tshow=> select count(*) from tshow.res_room_info ;
count
-------
22648
(1 row)

tshow=> select count(*) from tshow.res_room_weight ;
count
-------
23417

i don't see any Exclusive lock in pg_lock view .

657985552@qq.com

From: Bill Moran
Date: 2015-10-28 01:14
To: 657985552@qq.com
CC: pgsql-general
Subject: Re: [GENERAL]??: postgres cpu 100% need help
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:

> Dear sir:
>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.

9.3.5 is pretty old, you should probably schedule an upgrade.

> shared_buffers                                     | 25GB

Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.

If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.

--
Bill Moran

Re: ??: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

28 October 2015, 03:29:32

Thank you for your reply.

1. I'm sure the io is not problem .but I didn't record the IO status at that time.

2.what is THP compaction ? How can i rule this out?

3. explain

tshow=> explain select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r

tshow->   where t.subcatlg_id=46
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate  (cost=24581.50..24581.51 rows=1 width=60)
   ->  Hash Join  (cost=1737.27..24572.14 rows=3744 width=60)
         Hash Cond: (r.actorid = t.roomid)
         ->  Seq Scan on res_room_info r  (cost=0.00..22322.19 rows=22364 width=4)
               Filter: (levels >= 0)
         ->  Hash  (cost=1689.86..1689.86 rows=3793 width=64)
               ->  Seq Scan on res_room_weight t  (cost=0.00..1689.86 rows=3793 width=64)
                     Filter: (subcatlg_id = 46)

4. next i will use perf top to capture during load is essential

657985552@qq.com

From: Merlin Moncure
Date: 2015-10-28 02:35
To: Bill Moran
CC: 657985552@qq.com; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>> shared_buffers | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.

possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)

maybe some other things I'm not thinking of.

merlin

Re: Re: [GENERAL] 回复: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

28 October 2015, 07:12:37

Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r
tshow->   where t.subcatlg_id=46
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=24581.57..24581.58 rows=1 width=60) (actual time=112.613..112.613 rows=1 loops=1)
   Buffers: shared hit=23439
   ->  Hash Join  (cost=1737.31..24572.21 rows=3744 width=60) (actual time=16.375..111.879 rows=3730 loops=1)
         Hash Cond: (r.actorid = t.roomid)
         Buffers: shared hit=23439
         ->  Seq Scan on res_room_info r  (cost=0.00..22322.20 rows=22365 width=4) (actual time=0.031..86.026 rows=22365 loops=1)
               Filter: (levels >= 0)
               Rows Removed by Filter: 291
               Buffers: shared hit=22039
         ->  Hash  (cost=1689.90..1689.90 rows=3793 width=64) (actual time=16.274..16.274 rows=3793 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 329kB
               Buffers: shared hit=1397
               ->  Seq Scan on res_room_weight t  (cost=0.00..1689.90 rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
                     Filter: (subcatlg_id = 46)
                     Rows Removed by Filter: 19641
                     Buffers: shared hit=1397
Total runtime: 112.754 ms

cost only 112.754 ms . the High load time log is :

2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20 CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms execute <unnamed>: SELECT o_count FROM tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""

657985552@qq.com

发件人： Jeff Janes
发送时间： 2015-10-28 03:33
收件人： 657985552@qq.com
抄送： pgsql-general
主题： Re: [GENERAL] 回复: postgres cpu 100% need help
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.

From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop. What is the query it is running, and what is the execution plan for it? To get that, run:

explain (analyze, buffers) select o_count from ....

Cheers,

Jeff

Re: Re: Re: [GENERAL] 回复: postgres cpu 100% need help

From

John R Pierce

Date:

28 October 2015, 08:28:59

On 10/28/2015 12:12 AM, 657985552@qq.com wrote:

Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r
tshow->   where t.subcatlg_id=46
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;

-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

28 October 2015, 08:54:17

thanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql

,it cost 200-300ms . but the log record it cost 2000ms

log message:

duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'

657985552@qq.com

发件人： John R Pierce
发送时间： 2015-10-28 16:28
收件人： pgsql-general
主题： Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r
tshow->   where t.subcatlg_id=46
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine.    there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

28 October 2015, 08:56:31

I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql

,it cost 200-300ms . but the log record it cost 2000ms

log message:

duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'

thanks

657985552@qq.com

发件人： 657985552@qq.com
发送时间： 2015-10-28 16:53
收件人： John R Pierce; pgsql-general
主题： Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
thanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms

log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'

657985552@qq.com
发件人： John R Pierce
发送时间： 2015-10-28 16:28
收件人： pgsql-general
主题： Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r
tshow->   where t.subcatlg_id=46
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine.    there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] 回复: postgres cpu 100% need help

From

Jeff Janes

Date:

28 October 2015, 16:01:56

On Wed, Oct 28, 2015 at 12:12 AM, 657985552@qq.com <657985552@qq.com> wrote:
> Thank you for your reply.
> tshow=> explain (analyze, buffers)  select count(t.*)  from
> tshow.res_room_weight t,tshow.res_room_info r
> tshow->   where t.subcatlg_id=46
> tshow->   and t.roomid = r.actorid
> tshow->   and r.levels>=0;
>                                                                QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=24581.57..24581.58 rows=1 width=60) (actual
> time=112.613..112.613 rows=1 loops=1)
>    Buffers: shared hit=23439
>    ->  Hash Join  (cost=1737.31..24572.21 rows=3744 width=60) (actual
> time=16.375..111.879 rows=3730 loops=1)
>          Hash Cond: (r.actorid = t.roomid)
>          Buffers: shared hit=23439
>          ->  Seq Scan on res_room_info r  (cost=0.00..22322.20 rows=22365
> width=4) (actual time=0.031..86.026 rows=22365 loops=1)
>                Filter: (levels >= 0)
>                Rows Removed by Filter: 291
>                Buffers: shared hit=22039
>          ->  Hash  (cost=1689.90..1689.90 rows=3793 width=64) (actual
> time=16.274..16.274 rows=3793 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 329kB
>                Buffers: shared hit=1397
>                ->  Seq Scan on res_room_weight t  (cost=0.00..1689.90
> rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
>                      Filter: (subcatlg_id = 46)
>                      Rows Removed by Filter: 19641
>                      Buffers: shared hit=1397
>  Total runtime: 112.754 ms
>
> cost only  112.754 ms . the High load time log is :
>
> 2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20
> CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms  execute <unnamed>:
> SELECT          o_count                 FROM
> tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""


You really need it during the high-load condition.  Perhaps
auto_explain could help.

Cheers,

Jeff

=?gb18030?B?u9i4tKO6IFtHRU5FUkFMXSA/PzogcG9zdGdyZXMg?= =?gb18030?B?Y3B1IDEwMCUgbmVlZCBoZWxw?=

From

"莎士比亚说："

Date:

08 November 2015, 13:22:41

Hi moran and others;

yesterday i get the pg problem again, and i use perf top Observation follows:

PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs)

81.39% postgres [.] s_lock

5.42% postgres [.] LWLockAcquire

4.59% postgres [.] LWLockRelease

3.06% postgres [.] TransactionIdIsInProgress

0.38% postgres [.] PinBuffer

0.31% postgres [.] TransactionIdPrecedes

0.27% postgres [.] UnpinBuffer

0.19% postgres [.] TransactionIdIsCurrentTransactionId

0.16% postgres [.] heap_hot_search_buffer

0.15% [kernel] [k] number.isra.1

0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1

0.10% [kernel] [k] module_get_kallsym

0.10% libc-2.17.so [.] __strcmp_sse42

0.09% [kernel] [k] _raw_spin_lock

0.09% postgres [.] hash_search_with_hash_value

is spin lock problem ？ I need everyone's help to solve the problem.thsnks!

------------------ 原始邮件 ------------------

发件人: "Merlin Moncure";<mmoncure@gmail.com>;

发送时间: 2015年10月28日(星期三) 凌晨2:35

收件人: "Bill Moran"<wmoran@potentialtech.com>;

抄送: "莎士比亚说："<657985552@qq.com>; "pgsql-general"<pgsql-general@postgresql.org>;

主题: Re: [GENERAL] ??: postgres cpu 100% need help

On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>> shared_buffers | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.

possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)

maybe some other things I'm not thinking of.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 回复： [GENERAL] ??: postgres cpu 100% need help

From

Pavel Stehule

Date:

09 November 2015, 11:04:30

2015-11-08 14:22 GMT+01:00 莎士比亚说： <657985552@qq.com>:

Hi moran and others;

yesterday i get the pg problem again, and i use perf top Observation follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_value

is spin lock problem ？ I need everyone's help to solve the problem.thsnks!

yes, it is spin lock problem. you can try to limit number of active connections - http://www.enterprisedb.com/postgres-plus-edb-blog/amit-kapila/read-scalability-postgresql-95. You can try 9.4, where some issues is fixed

When you hit this issue, the basic prerequisite is test case. If you can run the simulation with this issue, then we can identify bottleneck.

You can try to check lwlocks via systemtap http://postgres.cz/wiki/Monitorov%C3%A1n%C3%AD_lwlocku_pomoc%C3%AD_systemtapu - the article is in Czech language, but this page support Google transalator.

Regards

Pavel

Re: ??: postgres cpu 100% need help

From

Merlin Moncure

Date:

10 November 2015, 14:26:53

On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说： <657985552@qq.com> wrote:
> Hi moran and others;
>
>     yesterday i get the pg  problem  again, and i use perf top Observation
> follows:
> PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],
> (all, 32 CPUs)
>     81.39%  postgres       [.] s_lock
>      5.42%  postgres       [.] LWLockAcquire
>      4.59%  postgres       [.] LWLockRelease
>      3.06%  postgres       [.] TransactionIdIsInProgress
>      0.38%  postgres       [.] PinBuffer
>      0.31%  postgres       [.] TransactionIdPrecedes
>      0.27%  postgres       [.] UnpinBuffer
>      0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
>      0.16%  postgres       [.] heap_hot_search_buffer
>      0.15%  [kernel]       [k] number.isra.1
>      0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
>      0.10%  [kernel]       [k] module_get_kallsym
>      0.10%  libc-2.17.so   [.] __strcmp_sse42
>      0.09%  [kernel]       [k] _raw_spin_lock
>      0.09%  postgres       [.] hash_search_with_hash_value
>
> is spin lock problem ？  I need everyone's help to solve the problem.thsnks!

Yup, spinlock problem.   These can be difficult.  What weneed now is
some metrics.   They must be captured during load event:

*) number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start)  FROM
pg_stat_activity where state != 'idle' GROUP BY 1;

*) context switches via "vmstat 1"  get a snapshot during load and
during non load time for comparison

Solution to this problem will probably be one or more of:
1) Significantly downgrade shared_buffers (say, to 4GB)

2) Upgrade database to 9.4 and hope for the best

3) Capture backtrace during load event to determine exactly which path
is going into spinlock

4) Install pgbouncer or another connection pooler to limit active
queries on database

5) Install experimental patches to custom compiled database to test
and verify a hypothetical fix

Out of those 5 things, which are possible for you to do?  Best case
scenario is that you have non-essential server that reproduces the
issue.

merlin

Re: ??: postgres cpu 100% need help

From

Merlin Moncure

Date:

10 November 2015, 15:55:30

On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说： <657985552@qq.com> wrote:
>> Hi moran and others;
>>
>>     yesterday i get the pg  problem  again, and i use perf top Observation
>> follows:
>> PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],
>> (all, 32 CPUs)
>>     81.39%  postgres       [.] s_lock
>>      5.42%  postgres       [.] LWLockAcquire
>>      4.59%  postgres       [.] LWLockRelease
>>      3.06%  postgres       [.] TransactionIdIsInProgress
>>      0.38%  postgres       [.] PinBuffer
>>      0.31%  postgres       [.] TransactionIdPrecedes
>>      0.27%  postgres       [.] UnpinBuffer
>>      0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
>>      0.16%  postgres       [.] heap_hot_search_buffer
>>      0.15%  [kernel]       [k] number.isra.1
>>      0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
>>      0.10%  [kernel]       [k] module_get_kallsym
>>      0.10%  libc-2.17.so   [.] __strcmp_sse42
>>      0.09%  [kernel]       [k] _raw_spin_lock
>>      0.09%  postgres       [.] hash_search_with_hash_value
>>
>> is spin lock problem ？  I need everyone's help to solve the problem.thsnks!
>
> Yup, spinlock problem.   These can be difficult.  What weneed now is
> some metrics.   They must be captured during load event:
>
> *) number of active queries with average duration:
> SELECT query, count(*), avg(now() - query_start)  FROM
> pg_stat_activity where state != 'idle' GROUP BY 1;
>
> *) context switches via "vmstat 1"  get a snapshot during load and
> during non load time for comparison
>
> Solution to this problem will probably be one or more of:
> 1) Significantly downgrade shared_buffers (say, to 4GB)
>
> 2) Upgrade database to 9.4 and hope for the best
>
> 3) Capture backtrace during load event to determine exactly which path
> is going into spinlock
>
> 4) Install pgbouncer or another connection pooler to limit active
> queries on database
>
> 5) Install experimental patches to custom compiled database to test
> and verify a hypothetical fix
>
> Out of those 5 things, which are possible for you to do?  Best case
> scenario is that you have non-essential server that reproduces the
> issue.

Can we also see output of:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag

merlin

Re: ??: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

23 November 2015, 06:25:58

hi Merlin:

I'm sorry to reply so late. I don't know exactly the resault :

> number of active queries with average duration:

> SELECT query, count(*), avg(now() - query_start) FROM

> pg_stat_activity where state != 'idle' GROUP BY 1;

there are avg 100 active connect during load even.every query cost 2000 ms+.

how can I Capture backtrace during load even?

[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never

657985552@qq.com

From: Merlin Moncure
Date: 2015-11-10 23:55
To: 莎士比亚说：
CC: Bill Moran; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说： <657985552@qq.com> wrote:
>> Hi moran and others;
>>
>>     yesterday i get the pg problem again, and i use perf top Observation
>> follows:
>> PerfTop:   11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],
>> (all, 32 CPUs)
>>     81.39% postgres       [.] s_lock
>>      5.42% postgres       [.] LWLockAcquire
>>      4.59% postgres       [.] LWLockRelease
>>      3.06% postgres       [.] TransactionIdIsInProgress
>>      0.38% postgres       [.] PinBuffer
>>      0.31% postgres       [.] TransactionIdPrecedes
>>      0.27% postgres       [.] UnpinBuffer
>>      0.19% postgres       [.] TransactionIdIsCurrentTransactionId
>>      0.16% postgres       [.] heap_hot_search_buffer
>>      0.15% [kernel]       [k] number.isra.1
>>      0.14% [kernel]       [k] kallsyms_expand_symbol.constprop.1
>>      0.10% [kernel]       [k] module_get_kallsym
>>      0.10% libc-2.17.so   [.] __strcmp_sse42
>>      0.09% [kernel]       [k] _raw_spin_lock
>>      0.09% postgres       [.] hash_search_with_hash_value
>>
>> is spin lock problem ？ I need everyone's help to solve the problem.thsnks!
>
> Yup, spinlock problem.   These can be difficult. What weneed now is
> some metrics.   They must be captured during load event:
>
> *) number of active queries with average duration:
> SELECT query, count(*), avg(now() - query_start) FROM
> pg_stat_activity where state != 'idle' GROUP BY 1;
>
> *) context switches via "vmstat 1" get a snapshot during load and
> during non load time for comparison
>
> Solution to this problem will probably be one or more of:
> 1) Significantly downgrade shared_buffers (say, to 4GB)
>
> 2) Upgrade database to 9.4 and hope for the best
>
> 3) Capture backtrace during load event to determine exactly which path
> is going into spinlock
>
> 4) Install pgbouncer or another connection pooler to limit active
> queries on database
>
> 5) Install experimental patches to custom compiled database to test
> and verify a hypothetical fix
>
> Out of those 5 things, which are possible for you to do? Best case
> scenario is that you have non-essential server that reproduces the
> issue.

Can we also see output of:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag

merlin

Re: ??: postgres cpu 100% need help

From

Kevin Grittner

Date:

23 November 2015, 15:30:26

On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:

> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
> [always] madvise never

There's your problem.  You need to set those to "never".  You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ??: postgres cpu 100% need help

From

Merlin Moncure

Date:

23 November 2015, 22:59:28

On Mon, Nov 23, 2015 at 9:30 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:
>
>> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
>> [always] madvise never
>
> There's your problem.  You need to set those to "never".  You can
> get immediate relief by echoing 'never' to those pseudo-files, but
> you need to configure your system to set them at OS boot time, too.

yup.  IMNSHO automatic enabling of THP defrag is one of the worst
performance related mistakes I've ever seen in the history of the
linux kernel (although maybe the blame lies with the distros...I'm not
sure).  It caused a number of terrible outages here before we figured
out the underlying cause.  (of course, everyone blamed postgres until
then)

merlin

Re: ??: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

25 November 2015, 02:15:52

hi kevin:

thanks for you help . but the following config Indicate my OS THP is Disable

[root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
HugePages_Total: 0
[root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
0

657985552@qq.com

From: Kevin Grittner
Date: 2015-11-23 23:30
To: 657985552@qq.com
CC: Merlin Moncure; Bill Moran; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:

> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
> [always] madvise never

There's your problem. You need to set those to "never". You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ??: postgres cpu 100% need help

From

Tom Lane

Date:

25 November 2015, 02:21:26

"657985552@qq.com" <657985552@qq.com> writes:
> thanks for you help . but the following config Indicate my OS THP is  Disable
> [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
> HugePages_Total:       0
> [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
> 0

Those values reflect whether *explicit* huge pages are enabled.  They do
not control *transparent* huge pages.

Red Hat's version of all this is documented here:
https://access.redhat.com/solutions/46111

            regards, tom lane

Re: ??: postgres cpu 100% need help

From

"657985552@qq.com"

Date:

25 November 2015, 03:58:18

oh .thanks i understand . but i still have a question .

[root@pg1 pgdata]# uname -a
Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@pg1 pgdata]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)

my os is centos7 . is there THP problem in it ?

657985552@qq.com

From: Tom Lane
Date: 2015-11-25 10:21
To: 657985552@qq.com
CC: Kevin Grittner; Merlin Moncure; Bill Moran; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
"657985552@qq.com" <657985552@qq.com> writes:
> thanks for you help . but the following config Indicate my OS THP is Disable
> [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
> HugePages_Total: 0
> [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
> 0

Those values reflect whether *explicit* huge pages are enabled. They do
not control *transparent* huge pages.

Red Hat's version of all this is documented here:
https://access.redhat.com/solutions/46111

regards, tom lane

Re: ??: postgres cpu 100% need help

From

Merlin Moncure

Date:

30 November 2015, 14:47:03

On Tue, Nov 24, 2015 at 9:57 PM, 657985552@qq.com <657985552@qq.com> wrote:
> oh .thanks i understand . but i still have a question .
> [root@pg1 pgdata]# uname -a
> Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014
> x86_64 x86_64 x86_64 GNU/Linux
> [root@pg1 pgdata]# cat /etc/redhat-release
> CentOS Linux release 7.0.1406 (Core)
>
> my os is centos7 .  is there  THP problem in it ?

Yes. The settings posted above (*/transparent_hugepage/*) are the smoking gun.

I've had the exact same problem as you; suddenly the database slows
down to zero and just as suddenly goes back to normal.  What is
happening here is that the operating system put in some
"optimizations" to help systems manage large amounts of memory
(typical server memory configurations have gone up by several orders
of magnitude since the 4kb page size was chosen).  These optimizations
do not play well with postgres memory access patterns; the operating
system is forced to defragment memory at random intervals which slows
down memory accesss causing spinlock problems.  Basically postgres and
the kernel get into a very high speed argument over memory access.

Lowering shared buffers to around 2GB also provides relief.  This
suggests that clock sweep is a contributor to the problem, in
particular it's maintenance of usage_count (the maintenance of which
IIRC is changing soon to pure atomic update) would be a place to start
sniffing around if we wanted to Really Fix It.   So far though, no one
has been able to reproduce this issue in a non production system.

I guess if we were using (non portable) futexes instead of hand
written spinlocks we'd probably have less problems in this area.
Nevertheless given the huge performance risks I really wonder what
RedHat was thinking when they enabled it by default.

merlin