I have two instances of a production application that uses Postgres 7.2,
deployed in two different data centers for about the last 6 months. The
sizes, schemas, configurations, hardware, and access patterns of the two
databases are nearly identical, but one consistently takes at least 5x
longer than the other for some common operations. During this time, CPU
usage and IO on the slow database are both high (sustained); I'm not
sure about the fast database. These common operations are chatty - at
least tens of thousands of queries over a 5 to 60 minute stretch - but
the queries themselves are fairly simple. The query plans are identical
across both databases, and the data distribution is comparable. The
tables involved in these common operations change frequently, and are
indexed according to these queries. The queries use the indexes as
expected. The tables involved have 50k-500k rows.
We 'vacuum analyze' nightly, and we recently rebuilt the indexes on the
slow database (using reindex table). This cut the number of index pages
dramatically: from ~1800 to ~50, but didn't noticeably change the time
or CPU utilization for the common operations described above.
When running pgbench, both databases have very similar results (200-260
over multiple runs with 5 concurrent threads).
I know of a few things I can do to make this operation marginally
simpler, but I'm most interested in the difference between the two
databases.
I haven't come up with a theory that explains each of these things.
What are some things I can look into to track this down further?
mike