Some very weird behaviour.... - Mailing list pgsql-performance
From | Chris Bowlby |
---|---|
Subject | Some very weird behaviour.... |
Date | |
Msg-id | 1057771777.33712.98.camel@freebsd47 Whole thread Raw |
Responses |
Re: Some very weird behaviour....
|
List | pgsql-performance |
Hi All, I'm sure some of you know me from previous questions on other lists, but this one has myself and Marc completely stumped. We've got a database that has about 89 Million rows, under PostgreSQL 7.3.3 on a dual PIII 1.2 with 4 GBytes of RAM on a 5 disk RAID 5 array. The dataset itself is about 26+ GBYtes in size, all of it in the one table. To give you some perspective on the size of the dataset and the performance level we are hitting, here are some "good" results based on some explains: jnlstats=# explain analyze select count(*) from some_table where some_time::date='2003-05-21'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual time=3013.55..3013.55 rows=1 loops=1) -> Index Scan using some_table_ix_0 on some_table (cost=0.00..1793446.02 rows=446531 width=0) (actual time=48.40..2721.26 rows=249837 loops=1) Index Cond: ((some_time)::date = '2003-05-21'::date) Total runtime: 3015.02 msec (4 rows) jnlstats=# explain analyze select count(*) from stats_raw where some_time::date='2003-05-21'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual time=1401.23..1401.23 rows=1 loops=1) -> Index Scan using some_table_ix_0 on some_table (cost=0.00..1793446.02 rows=446531 width=0) (actual time=0.50..1118.92 rows=249837 loops=1) Index Cond: ((some_time)::date = '2003-05-21'::date) Total runtime: 1401.42 msec There are about 249837 items that the query is identifying as valid results and the results range between 1-1.4 seconds over ten runs with the initial query taking 3 seconds, this average is how 90% of the queries resopond, but we've got several peaks that we can not explain in any way. For instance: jnlstats=# explain analyze select count(*) from some_table where some_time::date='2003-05-26'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual time=307025.65..307025.65 rows=1 loops=1) -> Index Scan using some_table_ix_0 on some_table (cost=0.00..1793446.02 rows=446531 width=0) (actual time=51.05..306256.93 rows=374540 loops=1) Index Cond: ((some_time)::date = '2003-05-26'::date) Total runtime: 307025.81 msec (4 rows) jnlstats=# explain analyze select count(*) from some_table where some_time::date='2003-05-26'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual time=10837.86..10837.86 rows=1 loops=1) -> Index Scan using some_table_ix_0 on some_table (cost=0.00..1793446.02 rows=446531 width=0) (actual time=1.01..10304.78 rows=374540 loops=1) Index Cond: ((some_time)::date = '2003-05-26'::date) Total runtime: 10838.04 msec The total number of items counted is 374540 items, so not too much more then the previous query, but the 300 second runtime was unexpected (we were expecting ~4-5 seconds and then ~1-2 seconds for the caches results. I have 5 other dates that all exhibit this information,but it's ONLY those dates that run that slow and the one I presented above here is the largest of them all. The database server is configured with a 5 MByte shared mamory buffer, but even a larger shared memory buffer does not help (we have had it set to 800 MBytes before). The disk is getting hit the heviest durring that last query, with iostat results being: tty da0 da1 da2 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 25 12.22 1 0.02 7.68 0 0.00 7.68 0 0.00 0 0 0 0 99 4 3758 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 6 0 38 0 56 0 151 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 43 0 48 0 148 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 0 49 0 153 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 0 49 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 1 49 0 150 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 42 0 49 0 153 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 41 1 49 0 149 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8 0 45 0 48 0 148 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 11 0 41 0 48 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 12 0 38 0 50 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 14 0 37 0 49 0 152 16.00 20 0.31 8.00 9 0.07 8.00 16 0.12 0 0 1 0 98 0 152 2.00 1 0.00 0.00 0 0.00 1.00 2 0.00 0 0 0 0 100 0 152 11.33 6 0.07 0.00 0 0.00 1.00 2 0.00 0 0 0 0 99 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 0 0 99 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 0 0 99 0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100 0 154 0.00 0 0.00 5.89 18 0.10 0.00 0 0.00 0 0 1 0 98 The database is vacuumed about once every hour, the pg_stats table shows: jnlstats=# select tablename, attname, n_distinct from pg_stats where tablename = 'some_table'; tablename | attname | n_distinct -----------+------------------+------------ some_table | some_time | -0.24305 So the column is very distinct (assuming that's what a negative number means). What I'm looking for is any form of explaination that might be casusing those spikes, but at the same time a possible solution that might help me bring it down some... -- Chris Bowlby <excalibur@hub.org> Hub.Org Networking Services
pgsql-performance by date: