Thread: 回复: postgres cpu 100% need help
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
Here is the output when I strace the one of process when its cpu usages grow up to 100%:
recvfrom(10, "P\0\0\0J\0 \t\tselect o_count from tsh"..., 8192, 0, NULL, NULL) = 120
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
....more and more lseek(16, 0, SEEK_END) = 180543488 Occasionally select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1o_count\0\0\0\0\0\0\0\0"..., 75, 0, NULL, 0) = 75
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
ps:my postgres config:
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 1min
autovacuum_vacuum_cost_delay | 20ms
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.2
autovacuum_vacuum_threshold | 50
checkpoint_segments | 32
checkpoint_timeout | 5min
effective_cache_size | 64GB
maintenance_work_mem | 500MB
max_connections | 2000
max_files_per_process | 1000
work_mem | 20MB
shared_buffers | 25GB
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 1min
autovacuum_vacuum_cost_delay | 20ms
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.2
autovacuum_vacuum_threshold | 50
checkpoint_segments | 32
checkpoint_timeout | 5min
effective_cache_size | 64GB
maintenance_work_mem | 500MB
max_connections | 2000
max_files_per_process | 1000
work_mem | 20MB
shared_buffers | 25GB
Are those problems with the configuration?
qijia.wang@melot.cn
hello everyone:I postgresql version is 9.3.5 ,My database is oltp server normally cpu usage is 3%~10% . Every sql is return in 20ms.Concurrently activety sql is less then 5but sometimes my database cpu grows 90%+ ,the simple select sql use 2000+ms .select count(*) from pg_stat_activity where state<>'idle';count
-------
126I strace -p 86181(the one of process when cpu is 100%)recvfrom(10, "Q\0\0\0?select * from tshow.p_hpa"..., 8192, 0, NULL, NULL) = 64
lseek(8, 0, SEEK_END) = 32768
lseek(7, 0, SEEK_END) = 1155072
lseek(7, 0, SEEK_END) = 1155072
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)zhiwei.li
On Tue, 27 Oct 2015 11:30:45 +0800 "657985552@qq.com" <657985552@qq.com> wrote: > Dear sir: > Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Hereis the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its processcan suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 andthe database is oltp server. 9.3.5 is pretty old, you should probably schedule an upgrade. > shared_buffers | 25GB Try setting this to 16GB. It's been a while since I tested on large-memory/high-load systems, but I seem to remember that shared_buffers above 16G could cause these sorts of intermittant stalls. If that doesn't improve the situation, you'll probably need to provide more details, specifically the layout of the table in question, as well as the queries that are active when the problem occurs, and the contents of the pg_locks table when the problem is occurring. -- Bill Moran
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote: > On Tue, 27 Oct 2015 11:30:45 +0800 > "657985552@qq.com" <657985552@qq.com> wrote: > >> Dear sir: >> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Hereis the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its processcan suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 andthe database is oltp server. > > 9.3.5 is pretty old, you should probably schedule an upgrade. > >> shared_buffers | 25GB > > Try setting this to 16GB. It's been a while since I tested on > large-memory/high-load systems, but I seem to remember that > shared_buffers above 16G could cause these sorts of intermittant > stalls. > > If that doesn't improve the situation, you'll probably need to > provide more details, specifically the layout of the table in > question, as well as the queries that are active when the > problem occurs, and the contents of the pg_locks table when > the problem is occurring. possible culprits: *) io based problems (check iowait, rule this out first) *) THP compaction (rule this out second) *) runaway query plan *) concurrency problems within postgres itself (perf top capture during load is essential) maybe some other things I'm not thinking of. merlin
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop. What is the query it is running, and what is the execution plan for it? To get that, run:
explain (analyze, buffers) select o_count from ....
Cheers,
Jeff
On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote: >> >> Dear sir: >> Recently a wired question about postgresql database really >> bothered me a lot, so i really need your help. Here is the problem, in the >> most situations the postgre database work very well, Average 3500tps/s per >> day, the cpu usage of its process is 3%~10% and every query can be responsed >> in less than 20ms, but sometimes the cpu usages of its process can suddenly >> grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My >> postgresql version is 9.3.5 and the database is oltp server. > > > From all the lseeks on the same file, it looks like you are doing a > sequential scan on a table for the inside of a nested loop. What is the > query it is running, and what is the execution plan for it? To get that, > run: this smells like thp compaction:
On Tue, Oct 27, 2015 at 4:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote: >>> >>> Dear sir: >>> Recently a wired question about postgresql database really >>> bothered me a lot, so i really need your help. Here is the problem, in the >>> most situations the postgre database work very well, Average 3500tps/s per >>> day, the cpu usage of its process is 3%~10% and every query can be responsed >>> in less than 20ms, but sometimes the cpu usages of its process can suddenly >>> grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My >>> postgresql version is 9.3.5 and the database is oltp server. >> >> >> From all the lseeks on the same file, it looks like you are doing a >> sequential scan on a table for the inside of a nested loop. What is the >> query it is running, and what is the execution plan for it? To get that, >> run: > > this smells like thp compaction: drat -- sorry for partial email. check this link: http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/ if that turns out not to be productive, need to check 'perf top', and also rule out problems with individual queries. merlin
Thank you for your reply. i will try to change it and Waiting for it to appear again.the sql is :
select count(t.*) into o_count from tshow.res_room_weight t,tshow.res_room_info r
where t.subcatlg_id=:i_title_id
and t.roomid = r.actorid
and r.levels>=0;
where t.subcatlg_id=:i_title_id
and t.roomid = r.actorid
and r.levels>=0;
tshow=> \d res_room_info;
Table "tshow.res_room_info"
Column | Type | Modifiers
----------------+-----------------------------+---------------
actorid | integer | not null
nickname | text | not null
livetype | integer |
people_inroom | integer |
poster | character varying(128) |
actor_level | integer |
operatorid | integer |
jointime | timestamp without time zone |
signtime | timestamp without time zone |
levels | integer |
note | text |
leavereason | text |
register_city | integer |
vedio_level | integer |
is_good | integer | default 0
is_display | integer |
live_starttime | timestamp without time zone |
live_endtime | timestamp without time zone |
next_starttime | timestamp without time zone |
max_count | integer | default 40000
is_recommend | integer |
icon | integer |
rich_level | integer |
type | integer |
room_mode | integer |
room_theme | text |
portrait | text |
gender | integer | default 0
tag | text |
live_poster | text |
family_id | integer |
room_lock | integer | default 0
Indexes:
"res_room_info_pkey" PRIMARY KEY, btree (actorid)
"idx_res_room_info_cityid" btree (register_city)
Table "tshow.res_room_info"
Column | Type | Modifiers
----------------+-----------------------------+---------------
actorid | integer | not null
nickname | text | not null
livetype | integer |
people_inroom | integer |
poster | character varying(128) |
actor_level | integer |
operatorid | integer |
jointime | timestamp without time zone |
signtime | timestamp without time zone |
levels | integer |
note | text |
leavereason | text |
register_city | integer |
vedio_level | integer |
is_good | integer | default 0
is_display | integer |
live_starttime | timestamp without time zone |
live_endtime | timestamp without time zone |
next_starttime | timestamp without time zone |
max_count | integer | default 40000
is_recommend | integer |
icon | integer |
rich_level | integer |
type | integer |
room_mode | integer |
room_theme | text |
portrait | text |
gender | integer | default 0
tag | text |
live_poster | text |
family_id | integer |
room_lock | integer | default 0
Indexes:
"res_room_info_pkey" PRIMARY KEY, btree (actorid)
"idx_res_room_info_cityid" btree (register_city)
tshow=> \d tshow.res_room_weight
Table "tshow.res_room_weight"
Column | Type | Modifiers
--------------+-----------------------------+-----------
subcatlg_id | integer | not null
roomid | integer | not null
weight | integer | default 0
is_recommend | integer |
update_time | timestamp without time zone |
product_id | integer | default 1
create_time | timestamp without time zone |
Indexes:
"res_room_weight_pkey" PRIMARY KEY, btree (subcatlg_id, roomid) CLUSTER
Table "tshow.res_room_weight"
Column | Type | Modifiers
--------------+-----------------------------+-----------
subcatlg_id | integer | not null
roomid | integer | not null
weight | integer | default 0
is_recommend | integer |
update_time | timestamp without time zone |
product_id | integer | default 1
create_time | timestamp without time zone |
Indexes:
"res_room_weight_pkey" PRIMARY KEY, btree (subcatlg_id, roomid) CLUSTER
tshow=> select count(*) from tshow.res_room_info ;
count
-------
22648
(1 row)
count
-------
22648
(1 row)
tshow=> select count(*) from tshow.res_room_weight ;
count
-------
23417
count
-------
23417
i don't see any Exclusive lock in pg_lock view .
657985552@qq.com
From: Bill MoranDate: 2015-10-28 01:14To: 657985552@qq.comCC: pgsql-generalSubject: Re: [GENERAL]??: postgres cpu 100% need helpOn Tue, 27 Oct 2015 11:30:45 +0800"657985552@qq.com" <657985552@qq.com> wrote:> Dear sir:> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.9.3.5 is pretty old, you should probably schedule an upgrade.> shared_buffers | 25GBTry setting this to 16GB. It's been a while since I tested onlarge-memory/high-load systems, but I seem to remember thatshared_buffers above 16G could cause these sorts of intermittantstalls.If that doesn't improve the situation, you'll probably need toprovide more details, specifically the layout of the table inquestion, as well as the queries that are active when theproblem occurs, and the contents of the pg_locks table whenthe problem is occurring.--Bill Moran
Thank you for your reply.
1. I'm sure the io is not problem .but I didn't record the IO status at that time.
2.what is THP compaction ? How can i rule this out?
3. explain
tshow=> explain select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46 tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate (cost=24581.50..24581.51 rows=1 width=60)
-> Hash Join (cost=1737.27..24572.14 rows=3744 width=60)
Hash Cond: (r.actorid = t.roomid)
-> Seq Scan on res_room_info r (cost=0.00..22322.19 rows=22364 width=4)
Filter: (levels >= 0)
-> Hash (cost=1689.86..1689.86 rows=3793 width=64)
-> Seq Scan on res_room_weight t (cost=0.00..1689.86 rows=3793 width=64)
Filter: (subcatlg_id = 46)
4. next i will use perf top to capture during load is essential
657985552@qq.com
From: Merlin MoncureDate: 2015-10-28 02:35To: Bill MoranSubject: Re: [GENERAL] ??: postgres cpu 100% need helpOn Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:> On Tue, 27 Oct 2015 11:30:45 +0800> "657985552@qq.com" <657985552@qq.com> wrote:>>> Dear sir:>> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.>> 9.3.5 is pretty old, you should probably schedule an upgrade.>>> shared_buffers | 25GB>> Try setting this to 16GB. It's been a while since I tested on> large-memory/high-load systems, but I seem to remember that> shared_buffers above 16G could cause these sorts of intermittant> stalls.>> If that doesn't improve the situation, you'll probably need to> provide more details, specifically the layout of the table in> question, as well as the queries that are active when the> problem occurs, and the contents of the pg_locks table when> the problem is occurring.possible culprits:*) io based problems (check iowait, rule this out first)*) THP compaction (rule this out second)*) runaway query plan*) concurrency problems within postgres itself (perf top captureduring load is essential)maybe some other things I'm not thinking of.merlin
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=24581.57..24581.58 rows=1 width=60) (actual time=112.613..112.613 rows=1 loops=1)
Buffers: shared hit=23439
-> Hash Join (cost=1737.31..24572.21 rows=3744 width=60) (actual time=16.375..111.879 rows=3730 loops=1)
Hash Cond: (r.actorid = t.roomid)
Buffers: shared hit=23439
-> Seq Scan on res_room_info r (cost=0.00..22322.20 rows=22365 width=4) (actual time=0.031..86.026 rows=22365 loops=1)
Filter: (levels >= 0)
Rows Removed by Filter: 291
Buffers: shared hit=22039
-> Hash (cost=1689.90..1689.90 rows=3793 width=64) (actual time=16.274..16.274 rows=3793 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 329kB
Buffers: shared hit=1397
-> Seq Scan on res_room_weight t (cost=0.00..1689.90 rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
Filter: (subcatlg_id = 46)
Rows Removed by Filter: 19641
Buffers: shared hit=1397
Total runtime: 112.754 ms
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=24581.57..24581.58 rows=1 width=60) (actual time=112.613..112.613 rows=1 loops=1)
Buffers: shared hit=23439
-> Hash Join (cost=1737.31..24572.21 rows=3744 width=60) (actual time=16.375..111.879 rows=3730 loops=1)
Hash Cond: (r.actorid = t.roomid)
Buffers: shared hit=23439
-> Seq Scan on res_room_info r (cost=0.00..22322.20 rows=22365 width=4) (actual time=0.031..86.026 rows=22365 loops=1)
Filter: (levels >= 0)
Rows Removed by Filter: 291
Buffers: shared hit=22039
-> Hash (cost=1689.90..1689.90 rows=3793 width=64) (actual time=16.274..16.274 rows=3793 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 329kB
Buffers: shared hit=1397
-> Seq Scan on res_room_weight t (cost=0.00..1689.90 rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
Filter: (subcatlg_id = 46)
Rows Removed by Filter: 19641
Buffers: shared hit=1397
Total runtime: 112.754 ms
cost only 112.754 ms . the High load time log is :
2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20 CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms execute <unnamed>: SELECT o_count FROM tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""
657985552@qq.com
发件人: Jeff Janes发送时间: 2015-10-28 03:33收件人: 657985552@qq.com抄送: pgsql-general主题: Re: [GENERAL] 回复: postgres cpu 100% need helpOn Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:Dear sir:Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop. What is the query it is running, and what is the execution plan for it? To get that, run:explain (analyze, buffers) select o_count from ....Cheers,Jeff
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.
I would have written that query as the inner join it is, but that probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
-- john r pierce, recycling bits in santa cruz
thanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms
log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'
657985552@qq.com
发件人: John R Pierce发送时间: 2015-10-28 16:28收件人: pgsql-general主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need helpOn 10/28/2015 12:12 AM, 657985552@qq.com wrote:Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.
I would have written that query as the inner join it is, but that probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;-- john r pierce, recycling bits in santa cruz
I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms
log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'
thanks
657985552@qq.com
发件人: 657985552@qq.com发送时间: 2015-10-28 16:53收件人: John R Pierce; pgsql-general主题: Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need helpthanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql,it cost 200-300ms . but the log record it cost 2000mslog message:duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'657985552@qq.com发件人: John R Pierce发送时间: 2015-10-28 16:28收件人: pgsql-general主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need helpOn 10/28/2015 12:12 AM, 657985552@qq.com wrote:Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.
I would have written that query as the inner join it is, but that probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;-- john r pierce, recycling bits in santa cruz
On Wed, Oct 28, 2015 at 12:12 AM, 657985552@qq.com <657985552@qq.com> wrote: > Thank you for your reply. > tshow=> explain (analyze, buffers) select count(t.*) from > tshow.res_room_weight t,tshow.res_room_info r > tshow-> where t.subcatlg_id=46 > tshow-> and t.roomid = r.actorid > tshow-> and r.levels>=0; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=24581.57..24581.58 rows=1 width=60) (actual > time=112.613..112.613 rows=1 loops=1) > Buffers: shared hit=23439 > -> Hash Join (cost=1737.31..24572.21 rows=3744 width=60) (actual > time=16.375..111.879 rows=3730 loops=1) > Hash Cond: (r.actorid = t.roomid) > Buffers: shared hit=23439 > -> Seq Scan on res_room_info r (cost=0.00..22322.20 rows=22365 > width=4) (actual time=0.031..86.026 rows=22365 loops=1) > Filter: (levels >= 0) > Rows Removed by Filter: 291 > Buffers: shared hit=22039 > -> Hash (cost=1689.90..1689.90 rows=3793 width=64) (actual > time=16.274..16.274 rows=3793 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 329kB > Buffers: shared hit=1397 > -> Seq Scan on res_room_weight t (cost=0.00..1689.90 > rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1) > Filter: (subcatlg_id = 46) > Rows Removed by Filter: 19641 > Buffers: shared hit=1397 > Total runtime: 112.754 ms > > cost only 112.754 ms . the High load time log is : > > 2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20 > CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms execute <unnamed>: > SELECT o_count FROM > tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,"" You really need it during the high-load condition. Perhaps auto_explain could help. Cheers, Jeff
=?gb18030?B?u9i4tKO6IFtHRU5FUkFMXSA/PzogcG9zdGdyZXMg?= =?gb18030?B?Y3B1IDEwMCUgbmVlZCBoZWxw?=
From
"莎士比亚说:"
Date:
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_value
is spin lock problem ? I need everyone's help to solve the problem.thsnks!
------------------ 原始邮件 ------------------
发件人: "Merlin Moncure";<mmoncure@gmail.com>;
发送时间: 2015年10月28日(星期三) 凌晨2:35
收件人: "Bill Moran"<wmoran@potentialtech.com>;
抄送: "莎士比亚说:"<657985552@qq.com>; "pgsql-general"<pgsql-general@postgresql.org>;
主题: Re: [GENERAL] ??: postgres cpu 100% need help
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>> Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>> shared_buffers | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.
possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)
maybe some other things I'm not thinking of.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
2015-11-08 14:22 GMT+01:00 莎士比亚说: <657985552@qq.com>:
Hi moran and others;yesterday i get the pg problem again, and i use perf top Observation follows:PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs)81.39% postgres [.] s_lock5.42% postgres [.] LWLockAcquire4.59% postgres [.] LWLockRelease3.06% postgres [.] TransactionIdIsInProgress0.38% postgres [.] PinBuffer0.31% postgres [.] TransactionIdPrecedes0.27% postgres [.] UnpinBuffer0.19% postgres [.] TransactionIdIsCurrentTransactionId0.16% postgres [.] heap_hot_search_buffer0.15% [kernel] [k] number.isra.10.14% [kernel] [k] kallsyms_expand_symbol.constprop.10.10% [kernel] [k] module_get_kallsym0.10% libc-2.17.so [.] __strcmp_sse420.09% [kernel] [k] _raw_spin_lock0.09% postgres [.] hash_search_with_hash_valueis spin lock problem ? I need everyone's help to solve the problem.thsnks!
yes, it is spin lock problem. you can try to limit number of active connections - http://www.enterprisedb.com/postgres-plus-edb-blog/amit-kapila/read-scalability-postgresql-95. You can try 9.4, where some issues is fixed
When you hit this issue, the basic prerequisite is test case. If you can run the simulation with this issue, then we can identify bottleneck.
You can try to check lwlocks via systemtap http://postgres.cz/wiki/Monitorov%C3%A1n%C3%AD_lwlocku_pomoc%C3%AD_systemtapu - the article is in Czech language, but this page support Google transalator.
Regards
Pavel
On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote: > Hi moran and others; > > yesterday i get the pg problem again, and i use perf top Observation > follows: > PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], > (all, 32 CPUs) > 81.39% postgres [.] s_lock > 5.42% postgres [.] LWLockAcquire > 4.59% postgres [.] LWLockRelease > 3.06% postgres [.] TransactionIdIsInProgress > 0.38% postgres [.] PinBuffer > 0.31% postgres [.] TransactionIdPrecedes > 0.27% postgres [.] UnpinBuffer > 0.19% postgres [.] TransactionIdIsCurrentTransactionId > 0.16% postgres [.] heap_hot_search_buffer > 0.15% [kernel] [k] number.isra.1 > 0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1 > 0.10% [kernel] [k] module_get_kallsym > 0.10% libc-2.17.so [.] __strcmp_sse42 > 0.09% [kernel] [k] _raw_spin_lock > 0.09% postgres [.] hash_search_with_hash_value > > is spin lock problem ? I need everyone's help to solve the problem.thsnks! Yup, spinlock problem. These can be difficult. What weneed now is some metrics. They must be captured during load event: *) number of active queries with average duration: SELECT query, count(*), avg(now() - query_start) FROM pg_stat_activity where state != 'idle' GROUP BY 1; *) context switches via "vmstat 1" get a snapshot during load and during non load time for comparison Solution to this problem will probably be one or more of: 1) Significantly downgrade shared_buffers (say, to 4GB) 2) Upgrade database to 9.4 and hope for the best 3) Capture backtrace during load event to determine exactly which path is going into spinlock 4) Install pgbouncer or another connection pooler to limit active queries on database 5) Install experimental patches to custom compiled database to test and verify a hypothetical fix Out of those 5 things, which are possible for you to do? Best case scenario is that you have non-essential server that reproduces the issue. merlin
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote: >> Hi moran and others; >> >> yesterday i get the pg problem again, and i use perf top Observation >> follows: >> PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], >> (all, 32 CPUs) >> 81.39% postgres [.] s_lock >> 5.42% postgres [.] LWLockAcquire >> 4.59% postgres [.] LWLockRelease >> 3.06% postgres [.] TransactionIdIsInProgress >> 0.38% postgres [.] PinBuffer >> 0.31% postgres [.] TransactionIdPrecedes >> 0.27% postgres [.] UnpinBuffer >> 0.19% postgres [.] TransactionIdIsCurrentTransactionId >> 0.16% postgres [.] heap_hot_search_buffer >> 0.15% [kernel] [k] number.isra.1 >> 0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1 >> 0.10% [kernel] [k] module_get_kallsym >> 0.10% libc-2.17.so [.] __strcmp_sse42 >> 0.09% [kernel] [k] _raw_spin_lock >> 0.09% postgres [.] hash_search_with_hash_value >> >> is spin lock problem ? I need everyone's help to solve the problem.thsnks! > > Yup, spinlock problem. These can be difficult. What weneed now is > some metrics. They must be captured during load event: > > *) number of active queries with average duration: > SELECT query, count(*), avg(now() - query_start) FROM > pg_stat_activity where state != 'idle' GROUP BY 1; > > *) context switches via "vmstat 1" get a snapshot during load and > during non load time for comparison > > Solution to this problem will probably be one or more of: > 1) Significantly downgrade shared_buffers (say, to 4GB) > > 2) Upgrade database to 9.4 and hope for the best > > 3) Capture backtrace during load event to determine exactly which path > is going into spinlock > > 4) Install pgbouncer or another connection pooler to limit active > queries on database > > 5) Install experimental patches to custom compiled database to test > and verify a hypothetical fix > > Out of those 5 things, which are possible for you to do? Best case > scenario is that you have non-essential server that reproduces the > issue. Can we also see output of: cat /sys/kernel/mm/redhat_transparent_hugepage/enabled cat /sys/kernel/mm/redhat_transparent_hugepage/defrag merlin
hi Merlin:
I'm sorry to reply so late. I don't know exactly the resault :
> number of active queries with average duration:
> SELECT query, count(*), avg(now() - query_start) FROM
> pg_stat_activity where state != 'idle' GROUP BY 1;
there are avg 100 active connect during load even.every query cost 2000 ms+.
how can I Capture backtrace during load even?
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never
657985552@qq.com
From: Merlin MoncureDate: 2015-11-10 23:55To: 莎士比亚说:CC: Bill Moran; pgsql-generalSubject: Re: [GENERAL] ??: postgres cpu 100% need helpOn Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:> On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:>> Hi moran and others;>>>> yesterday i get the pg problem again, and i use perf top Observation>> follows:>> PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],>> (all, 32 CPUs)>> 81.39% postgres [.] s_lock>> 5.42% postgres [.] LWLockAcquire>> 4.59% postgres [.] LWLockRelease>> 3.06% postgres [.] TransactionIdIsInProgress>> 0.38% postgres [.] PinBuffer>> 0.31% postgres [.] TransactionIdPrecedes>> 0.27% postgres [.] UnpinBuffer>> 0.19% postgres [.] TransactionIdIsCurrentTransactionId>> 0.16% postgres [.] heap_hot_search_buffer>> 0.15% [kernel] [k] number.isra.1>> 0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1>> 0.10% [kernel] [k] module_get_kallsym>> 0.10% libc-2.17.so [.] __strcmp_sse42>> 0.09% [kernel] [k] _raw_spin_lock>> 0.09% postgres [.] hash_search_with_hash_value>>>> is spin lock problem ? I need everyone's help to solve the problem.thsnks!>> Yup, spinlock problem. These can be difficult. What weneed now is> some metrics. They must be captured during load event:>> *) number of active queries with average duration:> SELECT query, count(*), avg(now() - query_start) FROM> pg_stat_activity where state != 'idle' GROUP BY 1;>> *) context switches via "vmstat 1" get a snapshot during load and> during non load time for comparison>> Solution to this problem will probably be one or more of:> 1) Significantly downgrade shared_buffers (say, to 4GB)>> 2) Upgrade database to 9.4 and hope for the best>> 3) Capture backtrace during load event to determine exactly which path> is going into spinlock>> 4) Install pgbouncer or another connection pooler to limit active> queries on database>> 5) Install experimental patches to custom compiled database to test> and verify a hypothetical fix>> Out of those 5 things, which are possible for you to do? Best case> scenario is that you have non-essential server that reproduces the> issue.Can we also see output of:cat /sys/kernel/mm/redhat_transparent_hugepage/enabledcat /sys/kernel/mm/redhat_transparent_hugepage/defragmerlin
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote: > [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag > [always] madvise never There's your problem. You need to set those to "never". You can get immediate relief by echoing 'never' to those pseudo-files, but you need to configure your system to set them at OS boot time, too. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 23, 2015 at 9:30 AM, Kevin Grittner <kgrittn@gmail.com> wrote: > On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote: > >> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled >> [always] madvise never >> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag >> [always] madvise never > > There's your problem. You need to set those to "never". You can > get immediate relief by echoing 'never' to those pseudo-files, but > you need to configure your system to set them at OS boot time, too. yup. IMNSHO automatic enabling of THP defrag is one of the worst performance related mistakes I've ever seen in the history of the linux kernel (although maybe the blame lies with the distros...I'm not sure). It caused a number of terrible outages here before we figured out the underlying cause. (of course, everyone blamed postgres until then) merlin
hi kevin:
thanks for you help . but the following config Indicate my OS THP is Disable
[root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
HugePages_Total: 0
[root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
0
HugePages_Total: 0
[root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
0
657985552@qq.com
From: Kevin GrittnerDate: 2015-11-23 23:30To: 657985552@qq.comSubject: Re: [GENERAL] ??: postgres cpu 100% need helpOn Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled> [always] madvise never> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag> [always] madvise neverThere's your problem. You need to set those to "never". You canget immediate relief by echoing 'never' to those pseudo-files, butyou need to configure your system to set them at OS boot time, too.--Kevin GrittnerEDB: http://www.enterprisedb.comThe Enterprise PostgreSQL Company
"657985552@qq.com" <657985552@qq.com> writes: > thanks for you help . but the following config Indicate my OS THP is Disable > [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo > HugePages_Total: 0 > [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages > 0 Those values reflect whether *explicit* huge pages are enabled. They do not control *transparent* huge pages. Red Hat's version of all this is documented here: https://access.redhat.com/solutions/46111 regards, tom lane
oh .thanks i understand . but i still have a question .
[root@pg1 pgdata]# uname -a
Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@pg1 pgdata]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)
Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@pg1 pgdata]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)
my os is centos7 . is there THP problem in it ?
657985552@qq.com
From: Tom LaneDate: 2015-11-25 10:21To: 657985552@qq.comSubject: Re: [GENERAL] ??: postgres cpu 100% need help"657985552@qq.com" <657985552@qq.com> writes:> thanks for you help . but the following config Indicate my OS THP is Disable> [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo> HugePages_Total: 0> [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages> 0Those values reflect whether *explicit* huge pages are enabled. They donot control *transparent* huge pages.Red Hat's version of all this is documented here:https://access.redhat.com/solutions/46111regards, tom lane
On Tue, Nov 24, 2015 at 9:57 PM, 657985552@qq.com <657985552@qq.com> wrote: > oh .thanks i understand . but i still have a question . > [root@pg1 pgdata]# uname -a > Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > [root@pg1 pgdata]# cat /etc/redhat-release > CentOS Linux release 7.0.1406 (Core) > > my os is centos7 . is there THP problem in it ? Yes. The settings posted above (*/transparent_hugepage/*) are the smoking gun. I've had the exact same problem as you; suddenly the database slows down to zero and just as suddenly goes back to normal. What is happening here is that the operating system put in some "optimizations" to help systems manage large amounts of memory (typical server memory configurations have gone up by several orders of magnitude since the 4kb page size was chosen). These optimizations do not play well with postgres memory access patterns; the operating system is forced to defragment memory at random intervals which slows down memory accesss causing spinlock problems. Basically postgres and the kernel get into a very high speed argument over memory access. Lowering shared buffers to around 2GB also provides relief. This suggests that clock sweep is a contributor to the problem, in particular it's maintenance of usage_count (the maintenance of which IIRC is changing soon to pure atomic update) would be a place to start sniffing around if we wanted to Really Fix It. So far though, no one has been able to reproduce this issue in a non production system. I guess if we were using (non portable) futexes instead of hand written spinlocks we'd probably have less problems in this area. Nevertheless given the huge performance risks I really wonder what RedHat was thinking when they enabled it by default. merlin