Query plan looks OK, but slow I/O - settings advice? - Mailing list pgsql-performance

From Roger Hand
Subject Query plan looks OK, but slow I/O - settings advice?
Date
Msg-id DB28E9B548192448A4E8C8A3C1B1E475611C66@sj1-exch-01.us.corp.kailea.com
Whole thread Raw
Responses Re: Query plan looks OK, but slow I/O - settings advice?  ("Jeffrey W. Baker" <jwbaker@acm.org>)
Re: Query plan looks OK, but slow I/O - settings advice?  (Gavin Sherry <swm@alcove.com.au>)
List pgsql-performance
Summary
=======
We are writing to the db pretty much 24 hours a day.
Recently the amount of data we write has increased, and the query speed, formerly okay, has taken a dive.
The query is using the indexes as expected, so I don't _think_ I have a query tuning issue, just an io problem.
The first time a query is done it takes about 60 seconds. The second time it runs in about 6 seconds.
What I know I need advice on is io settings and various buffer settings.
I may also need advice on other things, but just don't know it yet!

Below is ...
- an explain analyze
- details of the db setup and hardware
- some vmstat and iostat output showing the disks are very busy
- the SHOW ALL output for the db config.

Details
=======
Postgres 8.0.3

Below is a sample query. (This is actually implemented as a prepared statement. Here I fill in the '?'s with actual
values.)

electric=# EXPLAIN ANALYZE
electric-# SELECT datavalue, logfielddatatype, timestamp FROM logdata_recent
electric-# WHERE (logfielddatatype = 70 OR logfielddatatype = 71 OR logfielddatatype = 69)
electric-# AND graphtargetlog = 1327
electric-# AND timestamp >= 1123052400 AND timestamp <= 1123138800
electric-# ORDER BY timestamp;

                                                                       QUERY PLAN

                                 
--------------------------------------------------
 Sort  (cost=82.48..82.50 rows=6 width=14) (actual time=60208.968..60211.232 rows=2625 loops=1)
   Sort Key: public.logdata_recent."timestamp"
   ->  Result  (cost=0.00..82.41 rows=6 width=14) (actual time=52.483..60200.868 rows=2625 loops=1)
         ->  Append  (cost=0.00..82.41 rows=6 width=14) (actual time=52.476..60189.929 rows=2625 loops=1)
               ->  Seq Scan on logdata_recent  (cost=0.00..46.25 rows=1 width=14) (actual time=0.003..0.003 rows=0
loops=1)
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
               ->  Index Scan using logdata_recent_1123085306_ix_t_fld_gtl, logdata_recent_1123085306_ix_t_fld_gtl,
logdata_recent_1123085306_ix_t_fld_gtlon logdata_recent_stale logdata_recent  (cost=0.00..18.08 rows=3 width=14)
(actualtime=52.465..60181.624 rows=2625 loops=1) 
                     Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
70)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
71)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
69)AND (graphtargetlog = 1327))) 
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
               ->  Index Scan using logdata_recent_1123139634_ix_t_fld_gtl, logdata_recent_1123139634_ix_t_fld_gtl,
logdata_recent_1123139634_ix_t_fld_gtlon logdata_recent_active logdata_recent  (cost=0.00..18.08 rows=2 width=14)
(actualtime=0.178..0.178 rows=0 loops=1) 
                     Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
70)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
71)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
69)AND (graphtargetlog = 1327))) 
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
 Total runtime: 60214.545 ms
(13 rows)

60 seconds is much longer than it used to be. I would guess it used to be under 10 seconds. The second time the above
queryis run we see the magic of caching as the time goes down to 6 seconds. 

logdata_recent_active and logdata_recent_stale are inherited tables of logdata_recent, which never has any data. (This
ispseudo-partitioning in action!) 
So the very quick seq_scan on the empty logdata_recent parent table is okay with me.

The index is built on timestamp, logfielddatatype, graphtargetlog. I am curious as to why the same index shows up 3
timesin the "using" clause, but can live without knowing the details as long as it doesn't indicate that something's
wrong.

The logdata_recent_stale table has 5 millions rows. The size of the table itself, on disk, is 324MB. The size of the
indexis 210MB. 

The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change in
speed.

They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
4 disks raid 10 for indexes (145GB) - sdc1
6 disks raid 10 for data (220GB) - sdd1
2 mirrored disks for logs - sdb1

stripe size is 32k
cache policy: cached io (am told the controller has bbu)
write policy: write-back
read policy: readahead

The partition names do what they say ...
[root@rage-db2 /dbdata01]$ df
Filesystem   1K-blocks      Used Available Use% Mounted on
/dev/sdb1     70430588    729324  66123592   2% /dblog01
/dev/sdc1    140861236  19472588 114233300  15% /dbindex01
/dev/sdd1    211299960 157159988  43406548  79% /dbdata01
...

Using iostat (the version from http://linux.inet.hr/) I saw at one point that the data disk was 100% busy.
I believe this was when running the above query, or similar, but in any case the system is always busy with both reads
and(usually) writes. 

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    61    0.1   15.5     0.4   305.7   19.6   0.1    5.8   4.9   8
sdc1      21    22   20.6   17.7   164.6   158.6    8.4   1.1   28.3   6.2  24
sdd1    1742    11 1904.7    6.6 14585.6    71.5    7.7  20.6   10.8   0.5 100

Another time, when I was running the query above, the index partition went to 90+% busy for 40 seconds:

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.2     0.0     0.8    4.0   0.0   20.0  15.0   0
sdc1     366    53  687.0   66.1  4213.1   483.0    6.2  11.8   15.7   1.3  96
sdd1       8    17   16.6   13.9    99.5   125.4    7.4   0.7   23.0   1.9   6

On another occasion (when the query took 24 seconds) I ran vmstat and iostat every 5 seconds
from just before the query until just after. About the first two outputs are before the query.
In this case the index disk is maxed.

[root@rage-db2 ~]$ vmstat 5 16
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy wa id
 0  0     92 1233500 225692 9578564    0    0     0     0    1     1  2  2  1  1
 0  1     92 1218460 225748 9595136    0    0  3322    18  655   898  0  0 20 79
 0  1     92 1202124 225780 9616140    0    0  4204    58  920  1291  0  1 24 76
 0  1     92 1172876 225820 9645348    0    0  5847   120 1053  1482  0  1 23 76
 1  0     92 1151712 225836 9666504    0    0  4234     7  847  1239  2  1 18 78
 1  0     92 1140860 225844 9677436    0    0  2153   500  575  2027 13  2 11 73
 1  0     92 1140852 225848 9677636    0    0     0   506  213   442 10  1  0 89

[root@rage-db2 ~]$ /usr/local/bin/iostat -Px 5 16

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       1   243    0.1  105.5     2.7    37.6    0.4   0.0    0.2   0.1   1
sdc1       6   111    3.7   75.9    38.3   769.3   10.1   0.0    0.3   0.2   1
sdd1     255   107   85.2   37.6     4.9   581.0    4.8   0.0    0.1   0.0   1

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     273     0  414.0    0.4  2747.7     2.4    6.6   1.6    3.9   1.7  69
sdd1       0     1    1.4    0.4     7.2     5.6    7.1   0.0   10.0   6.7   1

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.4     0.0     1.6    4.0   0.0   10.0   5.0   0
sdc1     225     4  777.1    4.6  4011.0    35.1    5.2   2.5    3.2   1.3  99
sdd1       0     2    0.0    2.6     0.0    16.8    6.5   0.0    8.5   0.8   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     508     7  917.8    7.4  5703.0    58.3    6.2   2.2    2.4   1.1  98
sdd1       0     4    0.0    6.8     0.0    44.7    6.6   0.1   15.6   0.6   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     361     0  737.5    0.4  4391.7     2.4    6.0   1.8    2.4   1.0  76
sdd1       0     0    0.0    0.4     0.0     2.4    6.0   0.0    0.0   0.0   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    87    0.0   17.8     0.0   418.3   23.6   0.0    1.3   1.2   2
sdc1     216     2  489.5    0.4  2821.7    11.2    5.8   1.2    2.4   1.1  56
sdd1       2     4    7.2    0.6    37.5    18.4    7.2   0.0    6.2   3.3   3

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    89    0.0   22.4     0.0   446.3   20.0   0.0    1.1   0.8   2
sdc1       0     4    0.0    1.0     0.0    18.4   18.4   0.0    0.0   0.0   0
sdd1       0     6    0.0    1.0     0.0    27.1   27.2   0.0    0.0   0.0   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    89    0.0   22.5     0.0   446.2   19.8   0.0    0.4   0.3   1
sdc1       0     2    0.0    0.4     0.0     9.6   24.0   0.0    0.0   0.0   0
sdd1       0     4    0.0    0.6     0.0    20.0   33.3   0.0    0.0   0.0   0


Finally, here's a show all:

add_missing_from              | on
archive_command               | unset
australian_timezones          | off
authentication_timeout        | 60
bgwriter_delay                | 200
bgwriter_maxpages             | 100
bgwriter_percent              | 1
block_size                    | 8192
check_function_bodies         | on
checkpoint_segments           | 20
checkpoint_timeout            | 300
checkpoint_warning            | 30
client_encoding               | UNICODE
client_min_messages           | notice
commit_delay                  | 350
commit_siblings               | 5
config_file                   | /dbdata01/pgdata/postgresql.conf
cpu_index_tuple_cost          | 0.001
cpu_operator_cost             | 0.0025
cpu_tuple_cost                | 0.01
custom_variable_classes       | unset
data_directory                | /dbdata01/pgdata
DateStyle                     | ISO, MDY
db_user_namespace             | off
deadlock_timeout              | 1000
debug_pretty_print            | off
debug_print_parse             | off
debug_print_plan              | off
debug_print_rewritten         | off
debug_shared_buffers          | 0
default_statistics_target     | 50
default_tablespace            | unset
default_transaction_isolation | read committed
default_transaction_read_only | off
default_with_oids             | off
dynamic_library_path          | $libdir
effective_cache_size          | 48000
enable_hashagg                | on
enable_hashjoin               | on
enable_indexscan              | on
enable_mergejoin              | on
enable_nestloop               | on
enable_seqscan                | on
enable_sort                   | on
enable_tidscan                | on
explain_pretty_print          | on
external_pid_file             | unset
extra_float_digits            | 0
from_collapse_limit           | 8
fsync                         | on
geqo                          | on
geqo_effort                   | 5
geqo_generations              | 0
geqo_pool_size                | 0
geqo_selection_bias           | 2
geqo_threshold                | 12
hba_file                      | /dbdata01/pgdata/pg_hba.conf
ident_file                    | /dbdata01/pgdata/pg_ident.conf
integer_datetimes             | off
join_collapse_limit           | 8
krb_server_keyfile            | unset
lc_collate                    | en_US.UTF-8
lc_ctype                      | en_US.UTF-8
lc_messages                   | en_US.UTF-8
lc_monetary                   | en_US.UTF-8
lc_numeric                    | en_US.UTF-8
lc_time                       | en_US.UTF-8
listen_addresses              | *
log_connections               | off
log_destination               | stderr
log_directory                 | /dblog01
log_disconnections            | off
log_duration                  | off
log_error_verbosity           | default
log_executor_stats            | off
log_filename                  | postgresql-%Y-%m-%d_%H%M%S.log
log_hostname                  | off
log_line_prefix               | unset
log_min_duration_statement    | -1
log_min_error_statement       | panic
log_min_messages              | notice
log_parser_stats              | off
log_planner_stats             | off
log_rotation_age              | 1440
log_rotation_size             | 10240
log_statement                 | none
log_statement_stats           | off
log_truncate_on_rotation      | off
maintenance_work_mem          | 262144
max_connections               | 40
max_files_per_process         | 1000
max_fsm_pages                 | 100000
max_fsm_relations             | 1000
max_function_args             | 32
max_identifier_length         | 63
max_index_keys                | 32
max_locks_per_transaction     | 64
max_stack_depth               | 2048
password_encryption           | on
port                          | 5432
pre_auth_delay                | 0
preload_libraries             | unset
random_page_cost              | 4
redirect_stderr               | on
regex_flavor                  | advanced
rendezvous_name               | unset
search_path                   | $user,public
server_encoding               | UNICODE
server_version                | 8.0.3
shared_buffers                | 10240
silent_mode                   | off
sql_inheritance               | on
ssl                           | off
statement_timeout             | 0
stats_block_level             | off
stats_command_string          | off
stats_reset_on_server_start   | on
stats_row_level               | on
stats_start_collector         | on
superuser_reserved_connections | 2
syslog_facility               | LOCAL0
syslog_ident                  | postgres
TimeZone                      | PST8PDT
trace_notify                  | off
transaction_isolation         | read committed
transaction_read_only         | off
transform_null_equals         | off
unix_socket_directory         | unset
unix_socket_group             | unset
unix_socket_permissions       | 511
vacuum_cost_delay             | 180
vacuum_cost_limit             | 200
vacuum_cost_page_dirty        | 20
vacuum_cost_page_hit          | 1
vacuum_cost_page_miss         | 10
wal_buffers                   | 8
wal_sync_method               | fdatasync
work_mem                      | 98304
zero_damaged_pages            | off

pgsql-performance by date:

Previous
From: Oliver Jowett
Date:
Subject: Re: [JDBC] Data Selection Slow From VB 6.0
Next
From: Csaba Nagy
Date:
Subject: Re: [JDBC] Performance problem using V3 protocol in jdbc driver