Thread: strange hot_standby behaviour

strange hot_standby behaviour

From
pfote
Date:
Hi,

I had a very strange effect on the weekend that smells like a bug, so
i'd like so share it.

Setup:
machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10
machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi
Raid, substantially slower than A

The workload is about 80% - 90% SELECTs with heavy sorting and grouping,
the remaining are INSERTs/UPDATEs/DELETEs.
So In the original setup A  is the master, B and C are hot standby's
that process some of the SELECTs, but by far the most processing is done
on the master (A). pg version is 9.0.6. CPU utilization is about 80% on
the master and between 90-100% in the standby's, so it's decided to
upgrade to the latest 9.2 to profit from the latest performance
enhancements.

So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and
becomes master, then A becomes a hot_standby slave that takes all the
SELECTs (and C becomes another hot_standby). In the beginning everything
works as expected, CPU utilization drops from 80% to about 50-60%,
selects run faster, everything looks smoother (some queries drop from
 >5s to <1s due to 9.2s index-only-scan feature). Its friday, everyone
is happy.

About 16 hours later, saturday morning around 6:00, A suddenly goes wild
and has a CPU utilization of 100% without a change in the workload, out
of the blue. Queries that used to take <1s suddenly take 5-10s, "explain
analyze" plans of these queries havn't change a bit though. Switching
the workload off causes the server to become idle. (while I'm writing
this I realize we haven't tried to restart A). Instead, $boss decides to
twitch back to the original setup, so B gets dropped, A becomes master
and gets 100% of the workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and
everything becomes just like friday, CPU usage drops to 50-60%,
everything runs smothly.

I'm not sure yet if this is replication related or a 9.2.1 problem. Any
Ideas?

regards
Andreas Pfotenhauer
Ypsilon.NET AG


Re: strange hot_standby behaviour

From
Lonni J Friedman
Date:
On Mon, Oct 1, 2012 at 7:28 AM, pfote <pfote@ypsilon.net> wrote:
> Hi,
>
> I had a very strange effect on the weekend that smells like a bug, so i'd
> like so share it.
>
> Setup:
> machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10
> machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi
> Raid, substantially slower than A
>
> The workload is about 80% - 90% SELECTs with heavy sorting and grouping, the
> remaining are INSERTs/UPDATEs/DELETEs.
> So In the original setup A  is the master, B and C are hot standby's that
> process some of the SELECTs, but by far the most processing is done on the
> master (A). pg version is 9.0.6. CPU utilization is about 80% on the master
> and between 90-100% in the standby's, so it's decided to upgrade to the
> latest 9.2 to profit from the latest performance enhancements.
>
> So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and becomes
> master, then A becomes a hot_standby slave that takes all the SELECTs (and C
> becomes another hot_standby). In the beginning everything works as expected,
> CPU utilization drops from 80% to about 50-60%, selects run faster,
> everything looks smoother (some queries drop from >5s to <1s due to 9.2s
> index-only-scan feature). Its friday, everyone is happy.
>
> About 16 hours later, saturday morning around 6:00, A suddenly goes wild and
> has a CPU utilization of 100% without a change in the workload, out of the
> blue. Queries that used to take <1s suddenly take 5-10s, "explain analyze"
> plans of these queries havn't change a bit though. Switching the workload
> off causes the server to become idle. (while I'm writing this I realize we
> haven't tried to restart A). Instead, $boss decides to twitch back to the
> original setup, so B gets dropped, A becomes master and gets 100% of the
> workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and everything becomes just
> like friday, CPU usage drops to 50-60%, everything runs smothly.
>
> I'm not sure yet if this is replication related or a 9.2.1 problem. Any
> Ideas?

This could be just about anything.  Which OS are you running?  Did you
check any logs when everything went crazy?


Re: strange hot_standby behaviour

From
Andreas Pfotenhauer
Date:
> This could be just about anything.  Which OS are you running?  Did you
> check any logs when everything went crazy?
Sorry, should have been more verbose.

OS is debian squeeze, pg installed is the latest 9.2.1 from
pgapt.debian.net. Logs where checked, no errors/warnings at all. Query
plans have been checked, no changes before/after. The changed behavior
only manifested in the drop of the number of transactions/second
executed and the massively increased load in the machine. No increased
I/O or anything.

Since A has been promoted to master, everything behaves like expected,
the 9.2 outperformes the 9.0 easily, as expected.


Re: strange hot_standby behaviour

From
Jan Nielsen
Date:
On Mon, Oct 1, 2012 at 11:43 AM, Andreas Pfotenhauer <pfote@ypsilon.net> wrote:
> OS is debian squeeze, pg installed is the latest 9.2.1 from
> pgapt.debian.net. Logs where checked, no errors/warnings at all. Query plans
> have been checked, no changes before/after. The changed behavior only
> manifested in the drop of the number of transactions/second executed and the
> massively increased load in the machine. No increased I/O or anything.
>
> Since A has been promoted to master, everything behaves like expected, the
> 9.2 outperformes the 9.0 easily, as expected.

It's not clear to me from your description if you're using stock PG
Hot Standby but, if you are, there is no support for Hot Standby
between different versions of PG; see the "What Can't You Do With
Binary Replication" section in the Binary Replication Tutorial on the
PG wiki [1].

[1] http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Hot_Standby