Random performance hit, unknown cause. - Mailing list pgsql-performance

From Brian Fehrle
Subject Random performance hit, unknown cause.
Date
Msg-id 4F8721C4.8090300@consistentstate.com
Whole thread Raw
Responses Re: Random performance hit, unknown cause.
Re: Random performance hit, unknown cause.
List pgsql-performance
Hi all,

OS: Linux 64 bit 2.6.32
PostgreSQL 9.0.5 installed from Ubuntu packages.
8 CPU cores
64 GB system memory
Database cluster is on raid 10 direct attached drive, using a HP p800 controller card.


I have a system that has been having occasional performance hits, where the load on the system skyrockets, all queries take longer to execute and a hot standby slave I have set up via streaming replication starts to get behind. I'm having trouble pinpointing where the exact issue is.

This morning, during our nightly backup process (where we grab a copy of the data directory), we started having this same issue. The main thing that I see in all of these is a high disk wait on the system. When we are performing 'well', the %wa from top is usually around 30%, and our load is around 12 - 15. This morning we saw a load  21 - 23, and an %wa jumping between 60% and 75%.

The top process pretty much at all times is the WAL Sender Process, is this normal?

From what I can tell, my access patterns on the database has not changed, same average number of inserts, updates, deletes, and had nothing on the system changed in any way. No abnormal autovacuum processes that aren't normally already running.

So what things can I do to track down what an issue is? Currently the system has returned to a 'good' state, and performance looks great. But I would like to know how to prevent this, as well as be able to grab good stats if it does happen again in the future.

Has anyone had any issues with the HP p800 controller card in a postgres environment? Is there anything that can help us maximise the performance to disk in this case, as it seems to be one of our major bottlenecks? I do plan on moving the pg_xlog to a separate drive down the road, the cluster is extremely active so that will help out a ton.

some IO stats:

$ iostat -d -x 5 3
Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dev1            1.99    75.24  651.06  438.04 41668.57  8848.18    46.38     0.60    3.68   0.70  76.36
dev2            0.00     0.00  653.05  513.43 41668.57  8848.18    43.31     2.18    4.78   0.65  76.35

Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

dev1            0.00    35.20  676.20  292.00 35105.60  5688.00    42.13    67.76   70.73   1.03 100.00
dev2            0.00     0.00  671.80  295.40 35273.60  4843.20    41.48    73.41   76.62   1.03 100.00

Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

dev1            1.20    40.80  865.40  424.80 51355.20  8231.00    46.18    37.87   29.22   0.77  99.80
dev2            0.00     0.00  867.40  465.60 51041.60  8231.00    44.47    38.28   28.58   0.75  99.80


Thanks in advance,
Brian F

pgsql-performance by date:

Previous
From: Steve Crawford
Date:
Subject: Re: Linux machine aggressively clearing cache
Next
From: Claudio Freire
Date:
Subject: Re: Random performance hit, unknown cause.