Thread: Queries becoming slow under heavy load
Hi,
We are running some performances tests. With a lot of concurrent access, queries get very slow. When there is no load, those queries run fast.
We kind of see a trend about these queries: it seems like the ones that become very slow have an ORDER BY or MAX in them.
Here are our config settings:
name | setting | description
---------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------
add_missing_from | off | Automatically adds missing table references to FROM clauses.
allow_system_table_mods | off | Allows modifications of the structure of system tables.
archive_command | (disabled) | Sets the shell command that will be called to archive a WAL file.
archive_mode | off | Allows archiving of WAL files using archive_command.
archive_timeout | 0 | Forces a switch to the next xlog file if a new file has not been started within N seconds.
array_nulls | on | Enable input of NULL elements in arrays.
authentication_timeout | 1min | Sets the maximum allowed time to complete client authentication.
autovacuum | on | Starts the autovacuum subprocess.
autovacuum_analyze_scale_factor | 0.1 | Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples.
autovacuum_analyze_threshold | 250 | Minimum number of tuple inserts, updates or deletes prior to analyze.
autovacuum_freeze_max_age | 200000000 | Age at which to autovacuum a table to prevent transaction ID wraparound.
autovacuum_max_workers | 3 | Sets the maximum number of simultaneously running autovacuum worker processes.
autovacuum_naptime | 5min | Time to sleep between autovacuum runs.
autovacuum_vacuum_cost_delay | 20ms | Vacuum cost delay in milliseconds, for autovacuum.
autovacuum_vacuum_cost_limit | -1 | Vacuum cost amount available before napping, for autovacuum.
autovacuum_vacuum_scale_factor | 0.2 | Number of tuple updates or deletes prior to vacuum as a fraction of reltuples.
autovacuum_vacuum_threshold | 500 | Minimum number of tuple updates or deletes prior to vacuum.
backslash_quote | safe_encoding | Sets whether "\'" is allowed in string literals.
bgwriter_delay | 200ms | Background writer sleep time between rounds.
bgwriter_lru_maxpages | 100 | Background writer maximum number of LRU pages to flush per round.
bgwriter_lru_multiplier | 2 | Background writer multiplier on average buffers to scan per round.
block_size | 8192 | Shows the size of a disk block.
bonjour_name | | Sets the Bonjour broadcast service name.
check_function_bodies | on | Check function bodies during CREATE FUNCTION.
checkpoint_completion_target | 0.5 | Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval.
checkpoint_segments | 3 | Sets the maximum distance in log segments between automatic WAL checkpoints.
checkpoint_timeout | 5min | Sets the maximum time between automatic WAL checkpoints.
checkpoint_warning | 30s | Enables warnings if checkpoint segments are filled more frequently than this.
client_encoding | UTF8 | Sets the client's character set encoding.
client_min_messages | notice | Sets the message levels that are sent to the client.
commit_delay | 250 | Sets the delay in microseconds between transaction commit and flushing WAL to disk.
commit_siblings | 10 | Sets the minimum concurrent open transactions before performing commit_delay.
constraint_exclusion | off | Enables the planner to use constraints to optimize queries.
cpu_index_tuple_cost | 0.005 | Sets the planner's estimate of the cost of processing each index entry during an index scan.
cpu_operator_cost | 0.0025 | Sets the planner's estimate of the cost of processing each operator or function call.
cpu_tuple_cost | 0.01 | Sets the planner's estimate of the cost of processing each tuple (row).
custom_variable_classes | | Sets the list of known custom variable classes.
DateStyle | ISO, MDY | Sets the display format for date and time values.
db_user_namespace | off | Enables per-database user names.
deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock.
debug_assertions | off | Turns on various assertion checks.
debug_pretty_print | off | Indents parse and plan tree displays.
debug_print_parse | off | Prints the parse tree to the server log.
debug_print_plan | off | Prints the execution plan to server log.
debug_print_rewritten | off | Prints the parse tree after rewriting to server log.
default_statistics_target | 10 | Sets the default statistics target.
default_tablespace | | Sets the default tablespace to create tables and indexes in.
default_text_search_config | pg_catalog.simple | Sets default text search configuration.
and the box info:
> cat /proc/meminfo
MemTotal: 8177116 kB
MemFree: 2830212 kB
Buffers: 83212 kB
Cached: 2385740 kB
SwapCached: 32 kB
Active: 4037560 kB
Inactive: 1082912 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 8177116 kB
LowFree: 2830212 kB
SwapTotal: 2097112 kB
SwapFree: 2096612 kB
Dirty: 4548 kB
Writeback: 72 kB
AnonPages: 2651288 kB
Mapped: 311824 kB
Slab: 173968 kB
PageTables: 20512 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 6185668 kB
Committed_AS: 3602784 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 263672 kB
VmallocChunk: 34359474295 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
> cat /proc/meminfo
MemTotal: 8177116 kB
MemFree: 2830212 kB
Buffers: 83212 kB
Cached: 2385740 kB
SwapCached: 32 kB
Active: 4037560 kB
Inactive: 1082912 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 8177116 kB
LowFree: 2830212 kB
SwapTotal: 2097112 kB
SwapFree: 2096612 kB
Dirty: 4548 kB
Writeback: 72 kB
AnonPages: 2651288 kB
Mapped: 311824 kB
Slab: 173968 kB
PageTables: 20512 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 6185668 kB
Committed_AS: 3602784 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 263672 kB
VmallocChunk: 34359474295 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
It seems to me that we should try increasing shared_buffers. But do you have any other suggestions? Or do you see anything wrong in our config?
Thanks,
Anne
On 1/25/2011 3:37 PM, Anne Rosset wrote: > Hi, > > We are running some performances tests. With a lot of concurrent access, > queries get very slow. When there is no load, those queries run fast. > > We kind of see a trend about these queries: it seems like the ones that > become very slow have an ORDER BY or MAX in them. > > Here are our config settings: > <SNIP> > It seems to me that we should try increasing shared_buffers. But do you > have any other suggestions? Or do you see anything wrong in our config? > > Thanks, > > Anne > While I applaud your attempt to get us lots of information, unfortunately the the one property you ask about (shared_buffers), I can't seem to find. So, maybe you could post a bit more: 1) how many concurrent clients? 2) can we see an explain analyze for a query when its fast, and then again when its slow? 3) Is this box dedicated to PG or are there other services running? 4) Looks like you have 8 Gig of ram, so I assume this is a 64 bit OS, can you tell us what you have for: shared_buffers effective_cahce_size work_mem 5) Once many clients start hitting the db, it might not all fit into ram and start hitting the HD, can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc). The stats from /proc/meminfo: SwapTotal: 2097112 kB SwapFree: 2096612 kB Was this run when the system was busy? Looks like you are not using any swap, so thats good at least. Oh, wait, there are two cat /proc/meminfo's. Is one when its fast and one when its slow? Looks to me, in both cases, you are not using much memory at all. (if you happen to have 'free', its output is a little more readable, if you wouldn't mind posting it (only really need it for when the box is slow) -Andy
"Anne Rosset" <arosset@collab.net> wrote: > We are running some performances tests. With a lot of concurrent > access, queries get very slow. When there is no load, those > queries run fast. What's "a lot"? > We kind of see a trend about these queries: it seems like the > ones that become very slow have an ORDER BY or MAX in them. Without knowing the PostgreSQL version or more details about the queries, I would only be guessing at the cause. > It seems to me that we should try increasing shared_buffers. But > do you have any other suggestions? Or do you see anything wrong in > our config? I don't think you showed us your whole PostgreSQL configuration, and the format was hard to read -- it's best to show the contents of your postgresql.conf file, minus comments. If you read this page and re-post we can probably be more helpful: http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin
When you say that with a lot of concurrent access, queries get very slow, How many concurrent connections to your serverhave you had? more that max_connections´value? If you want to have many concurrent connections, you should have consider to use a pooling connection system like pgbounceror pgpool. Which are the values for: - work_mem - shared_buffers - maintenance_work_mem - effective_cache_size - effective_io_concurrency - server_version Which are your platform? Regards -- Ing. Marcos Luís Ortíz Valmaseda System Engineer -- Database Administrator Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://postgresql.uci.cu
On 25/01/2011 22:37, Anne Rosset wrote: > Hi, > > We are running some performances tests. With a lot of concurrent > access, queries get very slow. When there is no load, those queries run > fast. As others said, you need to stat how many concurrent clients are working on the database and also the number of logical CPUs (CPU cores, hyperthreading) are present in the machine. So far, as a rule of thumb, if you have more concurrent active connections (i.e. doing queries, not idling), you will experience a sharp decline in performance if this number exceeds the number of logical CPUs in the machine.
Sorry it seems like the postgres configuration didn't come thru the first time. name | setting --------------------------------- + -------------------------- add_missing_from | off allow_system_table_mods | off archive_command | (disabled) archive_mode | off archive_timeout | 0 array_nulls | on authentication_timeout | 1min autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold | 250 autovacuum_freeze_max_age | 200000000 autovacuum_max_workers | 3 autovacuum_naptime | 5min autovacuum_vacuum_cost_delay | 20ms autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 500 backslash_quote | safe_encoding bgwriter_delay | 200ms bgwriter_lru_maxpages | 100 bgwriter_lru_multiplier | 2 block_size | 8192 bonjour_name | check_function_bodies | on checkpoint_completion_target | 0.5 checkpoint_segments | 3 checkpoint_timeout | 5min checkpoint_warning | 30s client_encoding | UTF8 client_min_messages | notice commit_delay | 250 commit_siblings | 10 constraint_exclusion | off cpu_index_tuple_cost | 0.005 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 custom_variable_classes | DateStyle | ISO, MDY db_user_namespace | off deadlock_timeout | 1s debug_assertions | off debug_pretty_print | off debug_print_parse | off debug_print_plan | off debug_print_rewritten | off default_statistics_target | 10 default_tablespace | default_text_search_config | pg_catalog.simple default_transaction_isolation | read committed default_transaction_read_only | off default_with_oids | off effective_cache_size | 4000000kB enable_bitmapscan | on enable_hashagg | on enable_hashjoin | on enable_indexscan | on enable_mergejoin | off enable_nestloop | on enable_seqscan | on enable_sort | on enable_tidscan | on escape_string_warning | on explain_pretty_print | on extra_float_digits | 0 from_collapse_limit | 8 fsync | on full_page_writes | on geqo | off geqo_effort | 5 geqo_generations | 0 geqo_pool_size | 0 geqo_selection_bias | 2 geqo_threshold | 12 gin_fuzzy_search_limit | 0 ignore_system_indexes | off integer_datetimes | off join_collapse_limit | 8 krb_caseins_users | off krb_server_hostname | krb_srvname | postgres lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 lc_messages | en_US.UTF-8 lc_monetary | en_US.UTF-8 lc_numeric | en_US.UTF-8 lc_time | en_US.UTF-8 listen_addresses | 127.0.0.1,208.75.198.149 local_preload_libraries | log_autovacuum_min_duration | -1 log_checkpoints | off log_connections | off log_destination | stderr log_disconnections | off log_duration | off log_error_verbosity | default log_executor_stats | off log_hostname | off log_line_prefix | log_lock_waits | off log_min_duration_statement | -1 log_min_error_statement | error log_min_messages | notice log_parser_stats | off log_planner_stats | off log_rotation_age | 0 log_rotation_size | 0 log_statement | none log_statement_stats | off log_temp_files | -1 log_timezone | Asia/Kolkata log_truncate_on_rotation | off logging_collector | on maintenance_work_mem | 256MB max_connections | 100 max_files_per_process | 1000 max_fsm_pages | 500000 max_fsm_relations | 500 max_function_args | 100 max_identifier_length | 63 max_index_keys | 32 max_locks_per_transaction | 64 max_prepared_transactions | 5 max_stack_depth | 5MB password_encryption | on port | 5432 post_auth_delay | 0 pre_auth_delay | 0 random_page_cost | 4 regex_flavor | advanced search_path | "$user",public seq_page_cost | 1 server_encoding | UTF8 server_version | 8.3.8 server_version_num | 80308 session_replication_role | origin shared_buffers | 240MB silent_mode | off sql_inheritance | on ssl | off standard_conforming_strings | off statement_timeout | 0 superuser_reserved_connections | 3 synchronize_seqscans | on synchronous_commit | on syslog_facility | LOCAL0 syslog_ident | postgres tcp_keepalives_count | 9 tcp_keepalives_idle | 7200 tcp_keepalives_interval | 75 temp_buffers | 1024 temp_tablespaces | TimeZone | Asia/Kolkata timezone_abbreviations | Default trace_notify | off trace_sort | off track_activities | on track_counts | on transaction_isolation | read committed transaction_read_only | off transform_null_equals | off unix_socket_group | unix_socket_permissions | 511 update_process_title | on vacuum_cost_delay | 50ms vacuum_cost_limit | 200 vacuum_cost_page_dirty | 20 vacuum_cost_page_hit | 1 vacuum_cost_page_miss | 10 vacuum_freeze_min_age | 100000000 wal_buffers | 10MB wal_sync_method | fdatasync wal_writer_delay | 200ms work_mem | 64MB xmlbinary | base64 xmloption | content zero_damaged_pages | off (176 rows) Today we did more analysis and observed postgress processes that continually reported status 'D' in top. The corresponding vmstat showed a proportionate amount of processes under the 'b' column, "uninterruptible" state. We've been able to match long running database queries to such processes. This occurs under relatively low load average (say 4 out of 8) and can involve as little as 1 single sql query. It seems that many queries get into that state and that is causing our load average to spike very high. Queries are finishing even though we continue to see an increase in postgres processes in 'D' state. Are we facing some serious db locking? What could lead to this? (The box has 8G and 8 cores) Thanks for any help, Anne -----Original Message----- From: Andy Colson [mailto:andy@squeakycode.net] Sent: Tuesday, January 25, 2011 2:13 PM To: Anne Rosset Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Queries becoming slow under heavy load On 1/25/2011 3:37 PM, Anne Rosset wrote: > Hi, > > We are running some performances tests. With a lot of concurrent > access, queries get very slow. When there is no load, those queries run fast. > > We kind of see a trend about these queries: it seems like the ones > that become very slow have an ORDER BY or MAX in them. > > Here are our config settings: > <SNIP> > It seems to me that we should try increasing shared_buffers. But do > you have any other suggestions? Or do you see anything wrong in our config? > > Thanks, > > Anne > While I applaud your attempt to get us lots of information, unfortunately the the one property you ask about (shared_buffers), I can't seem to find. So, maybe you could post a bit more: 1) how many concurrent clients? 2) can we see an explain analyze for a query when its fast, and then again when its slow? 3) Is this box dedicated to PG or are there other services running? 4) Looks like you have 8 Gig of ram, so I assume this is a 64 bit OS, can you tell us what you have for: shared_buffers effective_cahce_size work_mem 5) Once many clients start hitting the db, it might not all fit into ram and start hitting the HD, can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc). The stats from /proc/meminfo: SwapTotal: 2097112 kB SwapFree: 2096612 kB Was this run when the system was busy? Looks like you are not using any swap, so thats good at least. Oh, wait, there are two cat /proc/meminfo's. Is one when its fast and one when its slow? Looks to me, in both cases, you are not using much memory at all. (if you happen to have 'free', its output is a little more readable, if you wouldn't mind posting it (only really need it for when the box is slow) -Andy
On 01/26/2011 10:04 AM, Anne Rosset wrote: > We've been able to match long running database queries to such > processes. This occurs under relatively low load average (say 4 out of > 8) and can involve as little as 1 single sql query. The b state means the process is blocking, waiting for... something. One thing you need to consider is far more than your CPU usage. If you have the 'sar' utility, run it as 'sar 1 100' just to see how your system is working. What you want to watch for is iowait. If even one query is churning your disks, every single other query that has to take even one block from disk instead of cache, is going to stall. If you see an iowait of anything greater than 5%, you'll want to check further on the device that contains your database with iostat. My favorite use of this is 'iostat -dmx [device] 1' where [device] is the block device where your data files are, if your WAL is somewhere else. And yeah, your shared_buffers are kinda on the lowish side. Your effective_cache_size is good, but you have a lot more room to increase PG-specific memory. Worse however, is your checkpoints. Lord. Increase checkpoint_segments to *at least* 20, and increase your checkpoint_completion_target to 0.7 or 0.8. Check your logs for checkpoint warnings, and I'll bet it's constantly complaining about increasing your checkpoint segments. Every checkpoint not started by the scheduled system risks a checkpoint spike, which can flood your system with IO regardless of which queries are running. That kind of IO storm will ruin your performance, and with only 3 checkpoint segments on a busy database, are probably happening constantly. Unfortunately we still need to know more. This is just based on your PG settings, and that's not really enough to know how "busy" your DB is. One way to check is to log the contents of pg_stat_database, especially the xact_commit and xact_rollback columns. Grab those with a timestamp. If you get a snapshot of that every minute, you can figure out how many queries you're processing per minute or per second pretty easily. We've hit 8600 TPS before and don't have nearly the trouble you've been reporting. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Ivan Voras > Sent: Wednesday, January 26, 2011 6:25 AM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Queries becoming slow under heavy load > > On 25/01/2011 22:37, Anne Rosset wrote: > > Hi, > > > > We are running some performances tests. With a lot of concurrent > > access, queries get very slow. When there is no load, those queries > run > > fast. > > As others said, you need to stat how many concurrent clients are > working > on the database and also the number of logical CPUs (CPU cores, > hyperthreading) are present in the machine. So far, as a rule of thumb, > if you have more concurrent active connections (i.e. doing queries, not > idling), you will experience a sharp decline in performance if this > number exceeds the number of logical CPUs in the machine. > Depending on what version the OP is running - I didn't see where a version was givin - if there is a "lot" number of idleconnections it can affect things as well. Tom indicated to me this should be "much better" in 8.4 and later. <slight deviation to put idle connections overhead in prespecive> We cut our idle connections from 600+ to a bit over 300 and saw a good drop in box load and query responsiveness. (stillget large user cpu load spikes though when a few hundred idle connection processes are woken open because they allappear to be sleeping on the same semaphore and one of them has work to do. ) (yeah I know get a pooler, to bad only bouncer seems to pool out idle connections with transaction pooling but then I loseprepared statements... I am still working on that part and getting off 8.3. yes our app tried to do its own quasi connectionpooling. When we deployed the app on a few hundred boxes the error of this choice years ago when this app livedon much fewer machines is now pretty evident.) > > -- > Sent via pgsql-performance mailing list (pgsql- > performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset@collab.net> wrote: <HUGE LIST OF SETTINGS DELETED> PLEASE post just the settings you changed. I'm not searching through a list that big for the interesting bits. > Today we did more analysis and observed postgress processes that > continually reported status 'D' in top. Full stop. The most likely problem here is that the query is now hitting the disks and waiting. If you have 1 disk and two users, the access speed will drop by factors, usually much higher than 2. To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you have users reading the disk drives. If you're writing you need more and more disks too. Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers. They allow the reads and writes to be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much faster, had better access, and all the sectors were near each other. To an extent. If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc)." you should provide it. If you've got a pair of 5k RPM SATA drives in a RAID-1 you might need more hardware. So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is happening. Don't tell us what you see, show us.
On Wed, Jan 26, 2011 at 10:16 AM, Shaun Thomas <sthomas@peak6.com> wrote: > Worse however, is your checkpoints. Lord. Increase checkpoint_segments to > *at least* 20, and increase your checkpoint_completion_target to 0.7 or 0.8. > Check your logs for checkpoint warnings, and I'll bet it's constantly > complaining about increasing your checkpoint segments. Every checkpoint not > started by the scheduled system risks a checkpoint spike, which can flood > your system with IO regardless of which queries are running. That kind of IO > storm will ruin your performance, and with only 3 checkpoint segments on a > busy database, are probably happening constantly. To Shaun: Unless she's not write bound but read bound. We can't tell because we haven't seen the queries. We haven't seen the output of iostat or vmstat. To Anne: Another tool to recommend is sar. it's part of the sysstat package on debian / ubuntu / rhel. you have to enable it in various ways, it'll tell you when you try to run it after installing it. It allows you to look back over the last 7 days, 5 minutes at a time, to see the trends on your servers. Very useful stuff and easy to graph in a spreadsheet or web page. Or just read it. For instance, here's the output of sar on the data drive of a slave (read only) server under slony replication. sar -d -f sa25|grep "02:[01234].:.. AM.\+dev251-1" Linux 2.6.32-27-server () 01/25/2011 _x86_64_ (16 CPU) 12:00:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:45:01 AM dev251-1 481.21 6981.74 1745.82 18.14 4.86 10.10 1.57 75.65 09:55:01 AM dev251-1 620.16 28539.52 2135.67 49.46 5.25 8.47 1.22 75.42 10:05:01 AM dev251-1 1497.16 29283.95 1898.94 20.83 13.89 9.28 0.64 96.52 10:15:01 AM dev251-1 1040.47 17123.89 2286.10 18.66 8.89 8.54 0.87 90.49 10:25:01 AM dev251-1 562.97 8802.77 1515.50 18.33 4.84 8.60 1.41 79.57 Let me interpret for ya, in case it's not obvious. IO Utilization runs from about 50% to about 90%. when it's at 90% we are running 700 to 1000 tps, reading at a maximum of 15MB a second and writing at a paltry 1M or so a second. Average wait stays around 10ms. If we use sar -u from the same time period, we cna match up iowait to this chart and see if we were really waiting on IO or not. 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 09:45:01 AM all 47.44 0.00 5.20 4.94 0.00 42.42 09:55:01 AM all 46.42 0.00 5.63 5.77 0.00 42.18 10:05:01 AM all 48.64 0.00 6.35 11.87 0.00 33.15 10:15:01 AM all 46.94 0.00 5.79 8.81 0.00 38.46 10:25:01 AM all 48.68 0.00 5.58 5.42 0.00 40.32 We can see that we have at peak, 11% of our CPU power is waiting behind IO. We have 16 CPUs, so each one is 100/16 or 6.25% of the total. So at 11% we have two cores on hold the whole time basically. In real life on this machine we have ALL cpus waiting about 11% of the time across the board. But the math comes out the same. We're waiting on IO. Here's a heavy db server, lots of ram, same time period. sdh is one of a large number of disks in a RAID-10 array. md17 is that RAID-10 array (actually the RAID0 at the top of a bunch of RAID-1s I still don't trust linux's RAID-10 implementation). 12:00:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:45:01 sdh 5.86 5.65 158.87 28.08 0.21 35.07 3.36 1.97 09:45:01 md17 253.78 168.69 1987.64 8.50 0.00 0.00 0.00 0.00 09:55:01 sdh 5.48 5.65 134.99 25.68 0.16 30.00 3.31 1.81 09:55:01 md17 215.13 157.56 1679.67 8.54 0.00 0.00 0.00 0.00 10:05:01 sdh 4.37 5.39 106.53 25.58 0.09 21.61 3.57 1.56 10:05:01 md17 170.81 133.76 1334.43 8.60 0.00 0.00 0.00 0.00 10:15:01 sdh 6.16 5.37 177.25 29.64 0.25 40.95 3.38 2.08 10:15:01 md17 280.63 137.88 2206.95 8.36 0.00 0.00 0.00 0.00 10:25:01 sdh 4.52 3.72 116.41 26.56 0.09 20.64 3.58 1.62 10:25:01 md17 187.65 107.59 1470.88 8.41 0.00 0.00 0.00 0.00 (Note that md devices do not show values for %util, svctm or await hence the need for sdh) This db fits the data set in ram, the other machine doesn't. It had a RAID controller, but that caught fire, and burned down. The next caught fire, burned down, and fell into the swamp. It now has a straight up SAS controller with no caching. Numbers were even better when it had a caching RAID controller, but I got tired of replacing them. OK, so md17 is handling only 280 tps, while the array on the other, smaller server, was around 1,000. The master is reading 5 or 6 sectors per second, while the slave is reading 30k sectors a second. The master is writing at ~1500 to 2000 sectors a second, the slave is similar. The slave server here was IO bound very badly because it A: didn't have enough memory to cache the data set, and B: hadn't had time to warm up to get what memory it had to do the job. It was thrown into the mix mid morning rush and it fell flat on its ass. If it had been warmed up first (running it at 100:1 load factor by our load balancing module to start) it would have been ok. It would have still had horrible IO numbers though. Once the caches load by 1500hrs, the slave is reading at just 500 sectors / sec, child's play really. So, get sar running, and get some numbers from the machine when these things are happening. Look for how it looks before during and after the crisis.
Scott, Thanks for your response. We are over NFS for our storage ... Here is what we see during our performance testing: This is about 7 seconds after the query was sent to postgres: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 89.9 5.0 3868:44 java 1846 postgres 16 0 474m 198m 103m R 75.2 2.5 0:28.69 postmaster 2170 postgres 15 0 391m 203m 188m R 44.0 2.6 0:17.63 postmaster 2555 httpd 18 0 298m 15m 4808 R 22.0 0.2 0:00.12 httpd 2558 root 15 0 29056 2324 1424 R 1.8 0.0 0:00.01 top 1207 httpd 15 0 337m 20m 7064 R 0.0 0.3 0:00.69 httpd 28312 postgres 16 0 396m 183m 162m D 0.0 2.3 0:50.82 postmaster <---- this is the query here Notice the 0% CPU, also, notice the 183m RES memory. Ten seconds later: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 92.9 5.0 3868:53 java 2657 root 15 0 29056 2328 1424 R 1.9 0.0 0:00.01 top 28312 postgres 16 0 396m 184m 162m D 0.0 2.3 0:50.84 postmaster <---- here Ten seconds after that: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 88.7 5.0 3869:02 java 1845 postgres 16 0 473m 223m 127m D 22.6 2.8 0:26.39 postmaster 2412 httpd 15 0 2245m 1.4g 16m R 18.9 17.8 0:02.48 java 966 postgres 15 0 395m 242m 221m D 0.0 3.0 1:02.31 postmaster 2680 root 15 0 29056 2336 1424 R 0.0 0.0 0:00.01 top 28312 postgres 16 0 396m 184m 163m D 0.0 2.3 0:50.85 postmaster <--- here etc.... and it's not until around the 221 second mark that we see catch it consuming CPU: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 postmaster <----- here 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 hald-addon-stor 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU. Looks like this is true "halting". Further, or worse yet, this same behavior expands out to multiple processes, producing a true "back up". It can look something like this. Notice the 0% cpu consumption: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 postmaster 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 postmaster 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 postmaster 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 postmaster 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 postmaster 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 postmaster Thanks for any additional explanation/advice, Anne -----Original Message----- From: Scott Marlowe [mailto:scott.marlowe@gmail.com] Sent: Wednesday, January 26, 2011 8:19 PM To: Anne Rosset Cc: pgsql-performance@postgresql.org Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset@collab.net> wrote: <HUGE LIST OF SETTINGS DELETED> PLEASE post just the settings you changed. I'm not searching through a list that big for the interesting bits. > Today we did more analysis and observed postgress processes that > continually reported status 'D' in top. Full stop. The most likely problem here is that the query is now hitting the disks and waiting. If you have 1 disk andtwo users, the access speed will drop by factors, usually much higher than 2. To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you haveusers reading the disk drives. If you're writing you need more and more disks too. Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers. They allow the reads and writesto be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much faster,had better access, and all the sectors were near each other. To an extent. If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, # ofdisks, etc)." you should provide it. If you've got a pair of 5k RPM SATA drives in a RAID-1 you might need more hardware. So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is happening. Don't tell us what you see, show us.
On 01/27/2011 11:12 PM, Anne Rosset wrote: > Thanks for your response. > We are over NFS for our storage ... NFS? I'm not sure you know this, but NFS has major locking issues that would make it a terrible candidate for hosting a database. > and it's not until around the 221 second mark that we see catch it consuming CPU: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java > 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 postmaster<----- here > 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top > 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 hald-addon-stor > 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java > > Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU. Please listen to us. We asked you to use sar, or iostat, to tell us how much the disk IO is being utilized. From your other screenshots, there were at least two other PG processes that were running and could have been thrashing the disk or locking tables your "slow" query needed. If it's waiting for disk IO, the CPU will remain low until it gets what it needs. Not everything is about the CPU. Especially now that we know your DB is running on top of NFS. > Further, or worse yet, this same behavior expands out to multiple processes, > producing a true "back up". It can look > something like this. Notice the 0% cpu consumption: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java > 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top > 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 postmaster > 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 postmaster > 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 postmaster > 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 postmaster > 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 postmaster > 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 postmaster Yes. And they could all be waiting for IO. Or NFS locking is blocking the reads. Or... what is that Java app doing? We don't know the state of your IO, and when you have 0% or very low CPU usage, you either have locking contention or you're being IO starved. And what queries are these connections performing? You can check it by getting the contents of the pg_stat_activity system view. If they're selecting and still "slow", compare that against the iostat or sar results. For instance, here's an IOSTAT of our system: iostat -dmx dm-9 1 Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88 0.08 0.19 0.08 2.91 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40 0.05 10.60 10.60 5.30 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00 0.01 7.00 7.00 1.40 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04 27.23 11.73 0.06 6.80 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10 29.31 49.65 0.79 67.90 That last column, %util, effectively tells us how saturated the controller is. If the percentage is high, it's really working hard to supply the data we're asking for, or trying to write. If it's low, we're probably working from memory cache, or getting less requests. There have been times our queries are "slow" and when we check this stat, it's often at or above 90%, sometimes for minutes at a time. That's almost always a clear indicator you have IO contention. Queries can't work without the data they need to return your results. Sending us more CPU charts isn't going to help us in helping you. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Shaun Thomas wrote: > On 01/27/2011 11:12 PM, Anne Rosset wrote: > > >> Thanks for your response. >> We are over NFS for our storage ... >> > > NFS? I'm not sure you know this, but NFS has major locking issues > that would make it a terrible candidate for hosting a database. > > That depends on the implementation. Vendor supported NAS, running NFS3 or NFS4 should be OK. There are other databases that can use it, too. Some databases even have a built-in NFS client. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Thanks to all of you who replied and pointed NFS as a potential culprit. Our issue was that pgsql's temp dir (pgsql_tmp) was set to the default value ( $PSQL_DIR/base/pgsql_tmp/) which was located in NFS. Moving the temp dir to local disk got us a huge improvement. Anne -----Original Message----- From: Shaun Thomas [mailto:sthomas@peak6.com] Sent: Friday, January 28, 2011 7:31 AM To: Anne Rosset Cc: pgsql-performance@postgresql.org Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load On 01/27/2011 11:12 PM, Anne Rosset wrote: > Thanks for your response. > We are over NFS for our storage ... NFS? I'm not sure you know this, but NFS has major locking issues that would make it a terrible candidate for hosting a database. > and it's not until around the 221 second mark that we see catch it consuming CPU: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java > 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 postmaster<----- here > 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top > 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 hald-addon-stor > 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java > > Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU. Please listen to us. We asked you to use sar, or iostat, to tell us how much the disk IO is being utilized. From your other screenshots, there were at least two other PG processes that were running and could have been thrashing the disk or locking tables your "slow" query needed. If it's waiting for disk IO, the CPU will remain low until it gets what it needs. Not everything is about the CPU. Especially now that we know your DB is running on top of NFS. > Further, or worse yet, this same behavior expands out to multiple > processes, producing a true "back up". It can look something like > this. Notice the 0% cpu consumption: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java > 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top > 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 postmaster > 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 postmaster > 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 postmaster > 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 postmaster > 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 postmaster > 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 postmaster Yes. And they could all be waiting for IO. Or NFS locking is blocking the reads. Or... what is that Java app doing? We don't know the state of your IO, and when you have 0% or very low CPU usage, you either have locking contention or you're being IO starved. And what queries are these connections performing? You can check it by getting the contents of the pg_stat_activity system view. If they're selecting and still "slow", compare that against the iostat or sar results. For instance, here's an IOSTAT of our system: iostat -dmx dm-9 1 Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88 0.08 0.19 0.08 2.91 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40 0.05 10.60 10.60 5.30 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00 0.01 7.00 7.00 1.40 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04 27.23 11.73 0.06 6.80 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10 29.31 49.65 0.79 67.90 That last column, %util, effectively tells us how saturated the controller is. If the percentage is high, it's really working hard to supply the data we're asking for, or trying to write. If it's low, we're probably working from memory cache, or getting less requests. There have been times our queries are "slow" and when we check this stat, it's often at or above 90%, sometimes for minutes at a time. That's almost always a clear indicator you have IO contention. Queries can't work without the data they need to return your results. Sending us more CPU charts isn't going to help us in helping you. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
Excellent! And you learned a bit more about how to monitor your server while you were at it. Win win! On Thu, Feb 3, 2011 at 10:40 AM, Anne Rosset <arosset@collab.net> wrote: > Thanks to all of you who replied and pointed NFS as a potential > culprit. > Our issue was that pgsql's temp dir (pgsql_tmp) was set to the default > value ( $PSQL_DIR/base/pgsql_tmp/) which was located in NFS. > Moving the temp dir to local disk got us a huge improvement. > > Anne > > -----Original Message----- > From: Shaun Thomas [mailto:sthomas@peak6.com] > Sent: Friday, January 28, 2011 7:31 AM > To: Anne Rosset > Cc: pgsql-performance@postgresql.org > Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load > > On 01/27/2011 11:12 PM, Anne Rosset wrote: > >> Thanks for your response. >> We are over NFS for our storage ... > > NFS? I'm not sure you know this, but NFS has major locking issues that > would make it a terrible candidate for hosting a database. > >> and it's not until around the 221 second mark that we see catch it > consuming CPU: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java >> 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 > postmaster<----- here >> 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top >> 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 > hald-addon-stor >> 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java >> >> Note that the load average is fine during this timeframe, ~4 out of 8, > so plenty of CPU. > > Please listen to us. We asked you to use sar, or iostat, to tell us how > much the disk IO is being utilized. From your other screenshots, there > were at least two other PG processes that were running and could have > been thrashing the disk or locking tables your "slow" query needed. If > it's waiting for disk IO, the CPU will remain low until it gets what it > needs. > > Not everything is about the CPU. Especially now that we know your DB is > running on top of NFS. > >> Further, or worse yet, this same behavior expands out to multiple >> processes, producing a true "back up". It can look something like >> this. Notice the 0% cpu consumption: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java >> 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top >> 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 > postmaster >> 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 > postmaster >> 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 > postmaster >> 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 > postmaster >> 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 > postmaster >> 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 > postmaster > > Yes. And they could all be waiting for IO. Or NFS locking is blocking > the reads. Or... what is that Java app doing? We don't know the state of > your IO, and when you have 0% or very low CPU usage, you either have > locking contention or you're being IO starved. > > And what queries are these connections performing? You can check it by > getting the contents of the pg_stat_activity system view. If they're > selecting and still "slow", compare that against the iostat or sar > results. For instance, here's an IOSTAT of our system: > > iostat -dmx dm-9 1 > > Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88 > 0.08 0.19 0.08 2.91 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40 > 0.05 10.60 10.60 5.30 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00 > 0.01 7.00 7.00 1.40 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04 > 27.23 11.73 0.06 6.80 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10 > 29.31 49.65 0.79 67.90 > > > That last column, %util, effectively tells us how saturated the > controller is. If the percentage is high, it's really working hard to > supply the data we're asking for, or trying to write. If it's low, we're > probably working from memory cache, or getting less requests. There have > been times our queries are "slow" and when we check this stat, it's > often at or above 90%, sometimes for minutes at a time. That's almost > always a clear indicator you have IO contention. Queries can't work > without the data they need to return your results. > > Sending us more CPU charts isn't going to help us in helping you. > > -- > Shaun Thomas > OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 > 312-676-8870 > sthomas@peak6.com > > ______________________________________________ > > See http://www.peak6.com/email_disclaimer.php > for terms and conditions related to this email > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- To understand recursion, one must first understand recursion.