This 8.0.8 on Solaris 8. Yes I know; old technologies but we can't
upgrade yet.
I have two identical servers. One is production and overnight we do
a complete dump and scp the results to the other. The other is standby.
It's currently running data that's about 7 months old because we haven't
needed to fail over yet.
I have one specific table ("sweep_users") that has 900,000 rows in prod
and 630,000 on the standby. On the standby a "select count(*) from
sweep_users" takes a couple of seconds. On production it takes... 240
seconds!
Data updates on this table consist of the following meta-logic:
for host in list_of_hosts
delete from sweep_users where hostid=host
for user in users_for_host
insert into sweep_users ....
vacuum analyze sweep_users
I'm at a loss to understand why the production server is so slow. While
the query is running "iostat -x" returns values like (on a striped mirror):
device r/s w/s kr/s kw/s wait actv svc_t %w %b
md30 764.0 0.0 92110.5 0.0 0.0 1.3 1.6 0 87
md30 1666.8 0.0 67254.3 0.0 0.0 1.7 1.0 0 92
md30 844.6 0.4 75716.1 0.3 0.0 1.3 1.6 0 90
The disk on the standby machine only shows 1/20th of that activity.
Now there is a difference in tuning between these values where we tried
to allocate more memory to the database to make queries more likely to be
in RAM (but we could have made a mistake)...
% diff prod standby
< effective_cache_size | 262144
---
> effective_cache_size | 1000
92c92
< maintenance_work_mem | 524288
---
> maintenance_work_mem | 16384
106c106
< random_page_cost | 1.5
---
> random_page_cost | 4
113c113
< shared_buffers | 30000
---
> shared_buffers | 1000
141c141
< work_mem | 20480
---
> work_mem | 1024
The only idea I have is that our update pattern is somehow causing excessive
fragmentation, either at the DB level or the OS file level.
Anyone else have any thoughts?
--
rgds
Stephen