Thread: EXPLAIN SELECT .. does not return
Hi, This has become a major problem for us. Thank you in advance for your help. OS: SUSE Linux 2.6.5-7.191-bigsmp PostgreSQL: 7.4.8 Application: ModPerl Web application using DBI.pm Database size: 100 Gb, 1025 Tables. Problem: EXPLAIN SELECT ... does not return. Description: The Application uses an EXPLAIN cost to determine whether a client's dynamic request for data is too demanding for the server so it can gracefully deny them. (Currently, anything over cost=0.00..500000.00). The system gets about 3000 page requests a day. Certain SQL Queries, I believe those with many table joins, when run as EXPLAIN plans, never return. As a result they seem to stay churning in the system. Once that happens other queries build up and the performance of the whole database server grinds to a halt. Postgresql never dies, but eventually, user requests start timing out. This happens on average two or three times a week. I kill an offending process and all's well again. I have not been able to identify with certainty an offending SQL statement. Config params, that have changed from default: tcpip_socket = true max_connections = 200 shared_buffers = 2000 sort_mem = 1048576 vacuum_mem = 65536 max_fsm_pages = 100000 max_fsm_relations = 1000 max_files_per_process = 1000 fsync = false wal_sync_method = fsync wal_buffers = 800 checkpoint_segments = 30 commit_delay = 100 commit_siblings = 50 effective_cache_size = 1000 random_page_cost = 4 geqo = true geqo_threshold = 14 default_statistics_target = 100 from_collapse_limit = 13 join_collapse_limit = 13 Note: we load lumps of data ea. week. Then primarily it is a readonly database.
On 12/5/05, David Link <dlink@soundscan.com> wrote: > Hi, This has become a major problem for us. Thank you in advance for > your help. > > OS: SUSE Linux 2.6.5-7.191-bigsmp > PostgreSQL: 7.4.8 > Application: ModPerl Web application using DBI.pm > Database size: 100 Gb, 1025 Tables. > > Problem: EXPLAIN SELECT ... does not return. > > Description: > > The Application uses an EXPLAIN cost to determine whether a client's dynamic > request for data is too demanding for the server so it can gracefully deny > them. (Currently, anything over cost=0.00..500000.00). > > The system gets about 3000 page requests a day. > > Certain SQL Queries, I believe those with many table joins, when run as > EXPLAIN plans, never return. As a result they seem to stay churning in the > system. Once that happens other queries build up and the performance of the > whole database server grinds to a halt. Postgresql never dies, but > eventually, user requests start timing out. > > This happens on average two or three times a week. I kill an offending > process and all's well again. I have not been able to identify with > certainty > an offending SQL statement. > > Config params, that have changed from default: > > tcpip_socket = true > max_connections = 200 > shared_buffers = 2000 > sort_mem = 1048576 > vacuum_mem = 65536 > max_fsm_pages = 100000 > max_fsm_relations = 1000 > max_files_per_process = 1000 > fsync = false > wal_sync_method = fsync > wal_buffers = 800 > checkpoint_segments = 30 > commit_delay = 100 > commit_siblings = 50 > effective_cache_size = 1000 > random_page_cost = 4 > geqo = true > geqo_threshold = 14 > default_statistics_target = 100 > from_collapse_limit = 13 > join_collapse_limit = 13 > > Note: we load lumps of data ea. week. Then primarily it is a readonly > database. > > when you have thoses cases, you can take a look in pg_stats_activity to find the offending query... or simply logs all queries -- Atentamente, Jaime Casanova (DBA: DataBase Aniquilator ;)
David Link <dlink@soundscan.com> writes: > Certain SQL Queries, I believe those with many table joins, when run as > EXPLAIN plans, never return. I'd guess that one or all of these settings are excessive: > geqo_threshold = 14 > from_collapse_limit = 13 > join_collapse_limit = 13 Keep in mind that the planning cost is exponential in these limits, eg geqo_threshold = 14 probably allows planning times about 14 times greater than geqo_threshold = 13. While I'm looking: > shared_buffers = 2000 That seems extremely low for modern machines. > sort_mem = 1048576 That, on the other hand, is almost certainly way too high for a system-wide setting. You're promising you have 1Gb available for *each* sort. > max_fsm_pages = 100000 And this way too low for a 100Gb database, unless most of the tables never see any UPDATEs or DELETEs. > wal_buffers = 800 Seems a bit high, especially considering you have fsync disabled and thus there is no benefit whatever to buffering WAL. > commit_delay = 100 > commit_siblings = 50 Have you measured any benefit to having this turned on? All in all it looks like your configuration settings were chosen by throwing darts :-( regards, tom lane
Jaime Casanova wrote: >when you have thoses cases, you can take a look in pg_stats_activity >to find the offending query... > >or simply logs all queries > > Thanks for the advice. I also turned on stat_command_string
Tom Lane wrote: >David Link <dlink@soundscan.com> writes: > > >>Certain SQL Queries, I believe those with many table joins, when run as >>EXPLAIN plans, never return. >> >> > >I'd guess that one or all of these settings are excessive: > > > >>geqo_threshold = 14 >>from_collapse_limit = 13 >>join_collapse_limit = 13 >> >> > >Keep in mind that the planning cost is exponential in these limits, >eg geqo_threshold = 14 probably allows planning times about 14 times >greater than geqo_threshold = 13. > >While I'm looking: > > > >>shared_buffers = 2000 >> >> > >That seems extremely low for modern machines. > > > >>sort_mem = 1048576 >> >> > >That, on the other hand, is almost certainly way too high for a system-wide >setting. You're promising you have 1Gb available for *each* sort. > > > >>max_fsm_pages = 100000 >> >> > >And this way too low for a 100Gb database, unless most of the tables >never see any UPDATEs or DELETEs. > > > >>wal_buffers = 800 >> >> > >Seems a bit high, especially considering you have fsync disabled and >thus there is no benefit whatever to buffering WAL. > > > >>commit_delay = 100 >>commit_siblings = 50 >> >> > >Have you measured any benefit to having this turned on? > >All in all it looks like your configuration settings were chosen by >throwing darts :-( > > regards, tom lane > > > Thanks for your reply, Tom. Different folks have made different suggestions. Can you suggest more reasonable values for these? But more importantly, do you think the problem I am having is due to these configuration short comings? Thanks much. David
>> >> > Thanks for your reply, Tom. Different folks have made different > suggestions. Can you suggest more reasonable values for these? But > more importantly, do you think the problem I am having is due to these > configuration short comings? > > Thanks much. David take a look at: http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html Joshua D. Drake > David > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
David Link <dlink@soundscan.com> writes: > more importantly, do you think the problem I am having is due to these > configuration short comings? Yeah, the planning thresholds ... regards, tom lane