Thread: Query plan looks OK, but slow I/O - settings advice?

From:
"Roger Hand"
Date:

Summary
=======
We are writing to the db pretty much 24 hours a day.
Recently the amount of data we write has increased, and the query speed, formerly okay, has taken a dive.
The query is using the indexes as expected, so I don't _think_ I have a query tuning issue, just an io problem.
The first time a query is done it takes about 60 seconds. The second time it runs in about 6 seconds.
What I know I need advice on is io settings and various buffer settings.
I may also need advice on other things, but just don't know it yet!

Below is ...
- an explain analyze
- details of the db setup and hardware
- some vmstat and iostat output showing the disks are very busy
- the SHOW ALL output for the db config.

Details
=======
Postgres 8.0.3

Below is a sample query. (This is actually implemented as a prepared statement. Here I fill in the '?'s with actual
values.)

electric=# EXPLAIN ANALYZE
electric-# SELECT datavalue, logfielddatatype, timestamp FROM logdata_recent
electric-# WHERE (logfielddatatype = 70 OR logfielddatatype = 71 OR logfielddatatype = 69)
electric-# AND graphtargetlog = 1327
electric-# AND timestamp >= 1123052400 AND timestamp <= 1123138800
electric-# ORDER BY timestamp;

                                                                       QUERY PLAN

                                 
--------------------------------------------------
 Sort  (cost=82.48..82.50 rows=6 width=14) (actual time=60208.968..60211.232 rows=2625 loops=1)
   Sort Key: public.logdata_recent."timestamp"
   ->  Result  (cost=0.00..82.41 rows=6 width=14) (actual time=52.483..60200.868 rows=2625 loops=1)
         ->  Append  (cost=0.00..82.41 rows=6 width=14) (actual time=52.476..60189.929 rows=2625 loops=1)
               ->  Seq Scan on logdata_recent  (cost=0.00..46.25 rows=1 width=14) (actual time=0.003..0.003 rows=0
loops=1)
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
               ->  Index Scan using logdata_recent_1123085306_ix_t_fld_gtl, logdata_recent_1123085306_ix_t_fld_gtl,
logdata_recent_1123085306_ix_t_fld_gtlon logdata_recent_stale logdata_recent  (cost=0.00..18.08 rows=3 width=14)
(actualtime=52.465..60181.624 rows=2625 loops=1) 
                     Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
70)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
71)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
69)AND (graphtargetlog = 1327))) 
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
               ->  Index Scan using logdata_recent_1123139634_ix_t_fld_gtl, logdata_recent_1123139634_ix_t_fld_gtl,
logdata_recent_1123139634_ix_t_fld_gtlon logdata_recent_active logdata_recent  (cost=0.00..18.08 rows=2 width=14)
(actualtime=0.178..0.178 rows=0 loops=1) 
                     Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
70)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
71)AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype =
69)AND (graphtargetlog = 1327))) 
                     Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
 Total runtime: 60214.545 ms
(13 rows)

60 seconds is much longer than it used to be. I would guess it used to be under 10 seconds. The second time the above
queryis run we see the magic of caching as the time goes down to 6 seconds. 

logdata_recent_active and logdata_recent_stale are inherited tables of logdata_recent, which never has any data. (This
ispseudo-partitioning in action!) 
So the very quick seq_scan on the empty logdata_recent parent table is okay with me.

The index is built on timestamp, logfielddatatype, graphtargetlog. I am curious as to why the same index shows up 3
timesin the "using" clause, but can live without knowing the details as long as it doesn't indicate that something's
wrong.

The logdata_recent_stale table has 5 millions rows. The size of the table itself, on disk, is 324MB. The size of the
indexis 210MB. 

The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change in
speed.

They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
4 disks raid 10 for indexes (145GB) - sdc1
6 disks raid 10 for data (220GB) - sdd1
2 mirrored disks for logs - sdb1

stripe size is 32k
cache policy: cached io (am told the controller has bbu)
write policy: write-back
read policy: readahead

The partition names do what they say ...
[root@rage-db2 /dbdata01]$ df
Filesystem   1K-blocks      Used Available Use% Mounted on
/dev/sdb1     70430588    729324  66123592   2% /dblog01
/dev/sdc1    140861236  19472588 114233300  15% /dbindex01
/dev/sdd1    211299960 157159988  43406548  79% /dbdata01
...

Using iostat (the version from http://linux.inet.hr/) I saw at one point that the data disk was 100% busy.
I believe this was when running the above query, or similar, but in any case the system is always busy with both reads
and(usually) writes. 

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    61    0.1   15.5     0.4   305.7   19.6   0.1    5.8   4.9   8
sdc1      21    22   20.6   17.7   164.6   158.6    8.4   1.1   28.3   6.2  24
sdd1    1742    11 1904.7    6.6 14585.6    71.5    7.7  20.6   10.8   0.5 100

Another time, when I was running the query above, the index partition went to 90+% busy for 40 seconds:

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.2     0.0     0.8    4.0   0.0   20.0  15.0   0
sdc1     366    53  687.0   66.1  4213.1   483.0    6.2  11.8   15.7   1.3  96
sdd1       8    17   16.6   13.9    99.5   125.4    7.4   0.7   23.0   1.9   6

On another occasion (when the query took 24 seconds) I ran vmstat and iostat every 5 seconds
from just before the query until just after. About the first two outputs are before the query.
In this case the index disk is maxed.

[root@rage-db2 ~]$ vmstat 5 16
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy wa id
 0  0     92 1233500 225692 9578564    0    0     0     0    1     1  2  2  1  1
 0  1     92 1218460 225748 9595136    0    0  3322    18  655   898  0  0 20 79
 0  1     92 1202124 225780 9616140    0    0  4204    58  920  1291  0  1 24 76
 0  1     92 1172876 225820 9645348    0    0  5847   120 1053  1482  0  1 23 76
 1  0     92 1151712 225836 9666504    0    0  4234     7  847  1239  2  1 18 78
 1  0     92 1140860 225844 9677436    0    0  2153   500  575  2027 13  2 11 73
 1  0     92 1140852 225848 9677636    0    0     0   506  213   442 10  1  0 89

[root@rage-db2 ~]$ /usr/local/bin/iostat -Px 5 16

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       1   243    0.1  105.5     2.7    37.6    0.4   0.0    0.2   0.1   1
sdc1       6   111    3.7   75.9    38.3   769.3   10.1   0.0    0.3   0.2   1
sdd1     255   107   85.2   37.6     4.9   581.0    4.8   0.0    0.1   0.0   1

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     273     0  414.0    0.4  2747.7     2.4    6.6   1.6    3.9   1.7  69
sdd1       0     1    1.4    0.4     7.2     5.6    7.1   0.0   10.0   6.7   1

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.4     0.0     1.6    4.0   0.0   10.0   5.0   0
sdc1     225     4  777.1    4.6  4011.0    35.1    5.2   2.5    3.2   1.3  99
sdd1       0     2    0.0    2.6     0.0    16.8    6.5   0.0    8.5   0.8   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     508     7  917.8    7.4  5703.0    58.3    6.2   2.2    2.4   1.1  98
sdd1       0     4    0.0    6.8     0.0    44.7    6.6   0.1   15.6   0.6   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc1     361     0  737.5    0.4  4391.7     2.4    6.0   1.8    2.4   1.0  76
sdd1       0     0    0.0    0.4     0.0     2.4    6.0   0.0    0.0   0.0   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    87    0.0   17.8     0.0   418.3   23.6   0.0    1.3   1.2   2
sdc1     216     2  489.5    0.4  2821.7    11.2    5.8   1.2    2.4   1.1  56
sdd1       2     4    7.2    0.6    37.5    18.4    7.2   0.0    6.2   3.3   3

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    89    0.0   22.4     0.0   446.3   20.0   0.0    1.1   0.8   2
sdc1       0     4    0.0    1.0     0.0    18.4   18.4   0.0    0.0   0.0   0
sdd1       0     6    0.0    1.0     0.0    27.1   27.2   0.0    0.0   0.0   0

device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sdb1       0    89    0.0   22.5     0.0   446.2   19.8   0.0    0.4   0.3   1
sdc1       0     2    0.0    0.4     0.0     9.6   24.0   0.0    0.0   0.0   0
sdd1       0     4    0.0    0.6     0.0    20.0   33.3   0.0    0.0   0.0   0


Finally, here's a show all:

add_missing_from              | on
archive_command               | unset
australian_timezones          | off
authentication_timeout        | 60
bgwriter_delay                | 200
bgwriter_maxpages             | 100
bgwriter_percent              | 1
block_size                    | 8192
check_function_bodies         | on
checkpoint_segments           | 20
checkpoint_timeout            | 300
checkpoint_warning            | 30
client_encoding               | UNICODE
client_min_messages           | notice
commit_delay                  | 350
commit_siblings               | 5
config_file                   | /dbdata01/pgdata/postgresql.conf
cpu_index_tuple_cost          | 0.001
cpu_operator_cost             | 0.0025
cpu_tuple_cost                | 0.01
custom_variable_classes       | unset
data_directory                | /dbdata01/pgdata
DateStyle                     | ISO, MDY
db_user_namespace             | off
deadlock_timeout              | 1000
debug_pretty_print            | off
debug_print_parse             | off
debug_print_plan              | off
debug_print_rewritten         | off
debug_shared_buffers          | 0
default_statistics_target     | 50
default_tablespace            | unset
default_transaction_isolation | read committed
default_transaction_read_only | off
default_with_oids             | off
dynamic_library_path          | $libdir
effective_cache_size          | 48000
enable_hashagg                | on
enable_hashjoin               | on
enable_indexscan              | on
enable_mergejoin              | on
enable_nestloop               | on
enable_seqscan                | on
enable_sort                   | on
enable_tidscan                | on
explain_pretty_print          | on
external_pid_file             | unset
extra_float_digits            | 0
from_collapse_limit           | 8
fsync                         | on
geqo                          | on
geqo_effort                   | 5
geqo_generations              | 0
geqo_pool_size                | 0
geqo_selection_bias           | 2
geqo_threshold                | 12
hba_file                      | /dbdata01/pgdata/pg_hba.conf
ident_file                    | /dbdata01/pgdata/pg_ident.conf
integer_datetimes             | off
join_collapse_limit           | 8
krb_server_keyfile            | unset
lc_collate                    | en_US.UTF-8
lc_ctype                      | en_US.UTF-8
lc_messages                   | en_US.UTF-8
lc_monetary                   | en_US.UTF-8
lc_numeric                    | en_US.UTF-8
lc_time                       | en_US.UTF-8
listen_addresses              | *
log_connections               | off
log_destination               | stderr
log_directory                 | /dblog01
log_disconnections            | off
log_duration                  | off
log_error_verbosity           | default
log_executor_stats            | off
log_filename                  | postgresql-%Y-%m-%d_%H%M%S.log
log_hostname                  | off
log_line_prefix               | unset
log_min_duration_statement    | -1
log_min_error_statement       | panic
log_min_messages              | notice
log_parser_stats              | off
log_planner_stats             | off
log_rotation_age              | 1440
log_rotation_size             | 10240
log_statement                 | none
log_statement_stats           | off
log_truncate_on_rotation      | off
maintenance_work_mem          | 262144
max_connections               | 40
max_files_per_process         | 1000
max_fsm_pages                 | 100000
max_fsm_relations             | 1000
max_function_args             | 32
max_identifier_length         | 63
max_index_keys                | 32
max_locks_per_transaction     | 64
max_stack_depth               | 2048
password_encryption           | on
port                          | 5432
pre_auth_delay                | 0
preload_libraries             | unset
random_page_cost              | 4
redirect_stderr               | on
regex_flavor                  | advanced
rendezvous_name               | unset
search_path                   | $user,public
server_encoding               | UNICODE
server_version                | 8.0.3
shared_buffers                | 10240
silent_mode                   | off
sql_inheritance               | on
ssl                           | off
statement_timeout             | 0
stats_block_level             | off
stats_command_string          | off
stats_reset_on_server_start   | on
stats_row_level               | on
stats_start_collector         | on
superuser_reserved_connections | 2
syslog_facility               | LOCAL0
syslog_ident                  | postgres
TimeZone                      | PST8PDT
trace_notify                  | off
transaction_isolation         | read committed
transaction_read_only         | off
transform_null_equals         | off
unix_socket_directory         | unset
unix_socket_group             | unset
unix_socket_permissions       | 511
vacuum_cost_delay             | 180
vacuum_cost_limit             | 200
vacuum_cost_page_dirty        | 20
vacuum_cost_page_hit          | 1
vacuum_cost_page_miss         | 10
wal_buffers                   | 8
wal_sync_method               | fdatasync
work_mem                      | 98304
zero_damaged_pages            | off

From:
"Jeffrey W. Baker"
Date:

On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
> The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change
inspeed. 
>
> They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
> 4 disks raid 10 for indexes (145GB) - sdc1
> 6 disks raid 10 for data (220GB) - sdd1
> 2 mirrored disks for logs - sdb1
>
> stripe size is 32k
> cache policy: cached io (am told the controller has bbu)
> write policy: write-back
> read policy: readahead

I assume you are using Linux 2.6.  Have you considered booting your
machine with elevator=deadline?  You can also change this at runtime
using sysfs.

These read speeds are not too impressive.  Perhaps this is a slow
controller.  Alternately you might need bigger CPUs.

There's a lot of possibilities, obviously :)  I'd start with the
elevator, since that's easily tested.

-jwb


From:
Gavin Sherry
Date:

The query plan does *not* look okay.

> electric=# EXPLAIN ANALYZE
> electric-# SELECT datavalue, logfielddatatype, timestamp FROM logdata_recent
> electric-# WHERE (logfielddatatype = 70 OR logfielddatatype = 71 OR logfielddatatype = 69)
> electric-# AND graphtargetlog = 1327
> electric-# AND timestamp >= 1123052400 AND timestamp <= 1123138800
> electric-# ORDER BY timestamp;
>
                                                                         QUERY PLAN 
> --------------------------------------------------
>  Sort  (cost=82.48..82.50 rows=6 width=14) (actual time=60208.968..60211.232 rows=2625 loops=1)
>    Sort Key: public.logdata_recent."timestamp"
>    ->  Result  (cost=0.00..82.41 rows=6 width=14) (actual time=52.483..60200.868 rows=2625 loops=1)
>          ->  Append  (cost=0.00..82.41 rows=6 width=14) (actual time=52.476..60189.929 rows=2625 loops=1)
>                ->  Seq Scan on logdata_recent  (cost=0.00..46.25 rows=1 width=14) (actual time=0.003..0.003 rows=0
loops=1)
>                      Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND
(graphtargetlog= 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800)) 
>                ->  Index Scan using logdata_recent_1123085306_ix_t_fld_gtl, logdata_recent_1123085306_ix_t_fld_gtl,
logdata_recent_1123085306_ix_t_fld_gtlon logdata_recent_stale logdata_recent  (cost=0.00..18.08 rows=3 width=14)
(actualtime=52.465..60181.624 rows=2625 loops=1) 

Notice here that expected rows is 3, but actual rows is a hell of a lot
higher. Try increasing stats collections for the columns on which
logdata_recent_1123085306_ix_t_fld_gtl is declared.

Also, the actual index scan is taking a long time. How recently have you
vacuum full'd?

Thanks,

Gavin

From:
"Roger Hand"
Date:

Jeffrey W. Baker wrote:
> On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
>> The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change
inspeed. 
>>
>> They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
>> 4 disks raid 10 for indexes (145GB) - sdc1
>> 6 disks raid 10 for data (220GB) - sdd1
>> 2 mirrored disks for logs - sdb1
>>
>> stripe size is 32k
>> cache policy: cached io (am told the controller has bbu)
>> write policy: write-back
>> read policy: readahead
>
> I assume you are using Linux 2.6.

Oops, sorry I left that out. Nope, we're on 2.4:

[root@rage-db2 ~]$ uname -a
Linux xxx.xxx.xxx 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux

It's RedHat Enterprise AS3.0 Fri Nov  5 17:55:14 PST 2004

> Have you considered booting your
> machine with elevator=deadline?

I just did a little Googling and see that the 2.4 kernel didn't have a decent elevator tuning system, and that was
fixedin 2.6. Hmmm .... 

Thanks for the ideas ...

-Roger

From:
Dan Harris
Date:

On Aug 19, 2005, at 12:55 AM, Jeffrey W. Baker wrote:

> On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
>   Have you considered booting your
> machine with elevator=deadline?

Although I'm not the OP for this problem, I thought I'd try it out.
WOW.. this should be in a Pg tuning guide somewhere. I added this to
my server tonight just for kicks and saw a pronounced improvement in
IO performance.  Thank you very much for mentioning this on the list.

I didn't have a long enough maintenance window to do solid
benchmarking, but I can say for certain that the change was
noticeable, especially in VACUUM operations.

Specs for the server:

PG 8.0.1
Linux 2.6.12-3 kernel
4xOpteron 2.2
12GB RAM
16-drive RAID 10
XFS mounted with noatime
pg_xlog on separate RAID controller

-Dan


From:
Michael Stone
Date:

On Sat, Aug 20, 2005 at 12:52:08AM -0600, Dan Harris wrote:
>On Aug 19, 2005, at 12:55 AM, Jeffrey W. Baker wrote:
>>   Have you considered booting your
>>machine with elevator=deadline?
>
>Although I'm not the OP for this problem, I thought I'd try it out.
>WOW.. this should be in a Pg tuning guide somewhere.
[snip]
>16-drive RAID 10

Yeah, the default scheduler tries to optimize disk access patterns for a
single-spindle setup, and actually makes things worse if you have a
device with multiple spindles.

Mike Stone