Hi All,
I'm sure some of you know me from previous questions on other lists,
but this one has myself and Marc completely stumped. We've got a
database that has about 89 Million rows, under PostgreSQL 7.3.3 on a
dual PIII 1.2 with 4 GBytes of RAM on a 5 disk RAID 5 array. The dataset
itself is about 26+ GBYtes in size, all of it in the one table.
To give you some perspective on the size of the dataset and the
performance level we are hitting, here are some "good" results based on
some explains:
jnlstats=# explain analyze select count(*) from some_table where
some_time::date='2003-05-21';
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual
time=3013.55..3013.55 rows=1 loops=1)
-> Index Scan using some_table_ix_0 on some_table
(cost=0.00..1793446.02 rows=446531 width=0) (actual time=48.40..2721.26
rows=249837 loops=1)
Index Cond: ((some_time)::date = '2003-05-21'::date)
Total runtime: 3015.02 msec
(4 rows)
jnlstats=# explain analyze select count(*) from stats_raw where
some_time::date='2003-05-21';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual
time=1401.23..1401.23 rows=1 loops=1)
-> Index Scan using some_table_ix_0 on some_table
(cost=0.00..1793446.02 rows=446531 width=0) (actual time=0.50..1118.92
rows=249837 loops=1)
Index Cond: ((some_time)::date = '2003-05-21'::date)
Total runtime: 1401.42 msec
There are about 249837 items that the query is identifying as valid
results and the results range between 1-1.4 seconds over ten runs with
the initial query taking 3 seconds, this average is how 90% of the
queries resopond, but we've got several peaks that we can not explain in
any way. For instance:
jnlstats=# explain analyze select count(*) from some_table where
some_time::date='2003-05-26';
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual
time=307025.65..307025.65 rows=1 loops=1)
-> Index Scan using some_table_ix_0 on some_table
(cost=0.00..1793446.02 rows=446531 width=0) (actual
time=51.05..306256.93 rows=374540 loops=1)
Index Cond: ((some_time)::date = '2003-05-26'::date)
Total runtime: 307025.81 msec
(4 rows)
jnlstats=# explain analyze select count(*) from some_table where
some_time::date='2003-05-26';
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1794562.35..1794562.35 rows=1 width=0) (actual
time=10837.86..10837.86 rows=1 loops=1)
-> Index Scan using some_table_ix_0 on some_table
(cost=0.00..1793446.02 rows=446531 width=0) (actual time=1.01..10304.78
rows=374540 loops=1)
Index Cond: ((some_time)::date = '2003-05-26'::date)
Total runtime: 10838.04 msec
The total number of items counted is 374540 items, so not too much more
then the previous query, but the 300 second runtime was unexpected (we
were expecting ~4-5 seconds and then ~1-2 seconds for the caches
results. I have 5 other dates that all exhibit this information,but it's
ONLY those dates that run that slow and the one I presented above here
is the largest of them all. The database server is configured with a 5
MByte shared mamory buffer, but even a larger shared memory buffer does
not help (we have had it set to 800 MBytes before). The disk is getting
hit the heviest durring that last query, with iostat results being:
tty da0 da1 da2
cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in
id
0 25 12.22 1 0.02 7.68 0 0.00 7.68 0 0.00 0 0 0 0
99
4 3758 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 6 0 38 0
56
0 151 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 43 0
48
0 148 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 0
49
0 153 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 0
49
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10 0 40 1
49
0 150 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 42 0
49
0 153 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9 0 41 1
49
0 149 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8 0 45 0
48
0 148 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 11 0 41 0
48
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 12 0 38 0
50
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 14 0 37 0
49
0 152 16.00 20 0.31 8.00 9 0.07 8.00 16 0.12 0 0 1 0
98
0 152 2.00 1 0.00 0.00 0 0.00 1.00 2 0.00 0 0 0 0
100
0 152 11.33 6 0.07 0.00 0 0.00 1.00 2 0.00 0 0 0 0
99
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 0 0
99
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 0 0
99
0 152 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0
100
0 154 0.00 0 0.00 5.89 18 0.10 0.00 0 0.00 0 0 1 0
98
The database is vacuumed about once every hour, the pg_stats table
shows:
jnlstats=# select tablename, attname, n_distinct from pg_stats where
tablename = 'some_table';
tablename | attname | n_distinct
-----------+------------------+------------
some_table | some_time | -0.24305
So the column is very distinct (assuming that's what a negative number
means). What I'm looking for is any form of explaination that might be
casusing those spikes, but at the same time a possible solution that
might help me bring it down some...
--
Chris Bowlby <excalibur@hub.org>
Hub.Org Networking Services