Thread: strange hot_standby behaviour
Hi, I had a very strange effect on the weekend that smells like a bug, so i'd like so share it. Setup: machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10 machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi Raid, substantially slower than A The workload is about 80% - 90% SELECTs with heavy sorting and grouping, the remaining are INSERTs/UPDATEs/DELETEs. So In the original setup A is the master, B and C are hot standby's that process some of the SELECTs, but by far the most processing is done on the master (A). pg version is 9.0.6. CPU utilization is about 80% on the master and between 90-100% in the standby's, so it's decided to upgrade to the latest 9.2 to profit from the latest performance enhancements. So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and becomes master, then A becomes a hot_standby slave that takes all the SELECTs (and C becomes another hot_standby). In the beginning everything works as expected, CPU utilization drops from 80% to about 50-60%, selects run faster, everything looks smoother (some queries drop from >5s to <1s due to 9.2s index-only-scan feature). Its friday, everyone is happy. About 16 hours later, saturday morning around 6:00, A suddenly goes wild and has a CPU utilization of 100% without a change in the workload, out of the blue. Queries that used to take <1s suddenly take 5-10s, "explain analyze" plans of these queries havn't change a bit though. Switching the workload off causes the server to become idle. (while I'm writing this I realize we haven't tried to restart A). Instead, $boss decides to twitch back to the original setup, so B gets dropped, A becomes master and gets 100% of the workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and everything becomes just like friday, CPU usage drops to 50-60%, everything runs smothly. I'm not sure yet if this is replication related or a 9.2.1 problem. Any Ideas? regards Andreas Pfotenhauer Ypsilon.NET AG
On Mon, Oct 1, 2012 at 7:28 AM, pfote <pfote@ypsilon.net> wrote: > Hi, > > I had a very strange effect on the weekend that smells like a bug, so i'd > like so share it. > > Setup: > machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10 > machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi > Raid, substantially slower than A > > The workload is about 80% - 90% SELECTs with heavy sorting and grouping, the > remaining are INSERTs/UPDATEs/DELETEs. > So In the original setup A is the master, B and C are hot standby's that > process some of the SELECTs, but by far the most processing is done on the > master (A). pg version is 9.0.6. CPU utilization is about 80% on the master > and between 90-100% in the standby's, so it's decided to upgrade to the > latest 9.2 to profit from the latest performance enhancements. > > So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and becomes > master, then A becomes a hot_standby slave that takes all the SELECTs (and C > becomes another hot_standby). In the beginning everything works as expected, > CPU utilization drops from 80% to about 50-60%, selects run faster, > everything looks smoother (some queries drop from >5s to <1s due to 9.2s > index-only-scan feature). Its friday, everyone is happy. > > About 16 hours later, saturday morning around 6:00, A suddenly goes wild and > has a CPU utilization of 100% without a change in the workload, out of the > blue. Queries that used to take <1s suddenly take 5-10s, "explain analyze" > plans of these queries havn't change a bit though. Switching the workload > off causes the server to become idle. (while I'm writing this I realize we > haven't tried to restart A). Instead, $boss decides to twitch back to the > original setup, so B gets dropped, A becomes master and gets 100% of the > workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and everything becomes just > like friday, CPU usage drops to 50-60%, everything runs smothly. > > I'm not sure yet if this is replication related or a 9.2.1 problem. Any > Ideas? This could be just about anything. Which OS are you running? Did you check any logs when everything went crazy?
> This could be just about anything. Which OS are you running? Did you > check any logs when everything went crazy? Sorry, should have been more verbose. OS is debian squeeze, pg installed is the latest 9.2.1 from pgapt.debian.net. Logs where checked, no errors/warnings at all. Query plans have been checked, no changes before/after. The changed behavior only manifested in the drop of the number of transactions/second executed and the massively increased load in the machine. No increased I/O or anything. Since A has been promoted to master, everything behaves like expected, the 9.2 outperformes the 9.0 easily, as expected.
On Mon, Oct 1, 2012 at 11:43 AM, Andreas Pfotenhauer <pfote@ypsilon.net> wrote: > OS is debian squeeze, pg installed is the latest 9.2.1 from > pgapt.debian.net. Logs where checked, no errors/warnings at all. Query plans > have been checked, no changes before/after. The changed behavior only > manifested in the drop of the number of transactions/second executed and the > massively increased load in the machine. No increased I/O or anything. > > Since A has been promoted to master, everything behaves like expected, the > 9.2 outperformes the 9.0 easily, as expected. It's not clear to me from your description if you're using stock PG Hot Standby but, if you are, there is no support for Hot Standby between different versions of PG; see the "What Can't You Do With Binary Replication" section in the Binary Replication Tutorial on the PG wiki [1]. [1] http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Hot_Standby